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Preface 


Econometric  models  and  methods  are  applied  in 
the  daily  practice  of  virtually  all  disciplines  in 
business  and  economics  like  finance,  marketing, 
microeconomics,  and  macroeconomics.  This 
book  is  meant  for  anyone  interested  in  obtaining 
a  solid  understanding  and  active  working  know¬ 
ledge  of  this  field.  The  book  provides  the  reader 
both  with  the  required  insight  in  econometric 
methods  and  with  the  practical  training  needed 
for  successful  applications.  The  guiding 
principle  of  the  book  is  to  stimulate  the  reader 
to  work  actively  on  examples  and  exercises,  so 
that  econometrics  is  learnt  the  way  it  works  in 
practice  —  that  is,  practical  methods  for  solving 
questions  in  business  and  economics,  based  on  a 
solid  understanding  of  the  underlying  methods. 
In  this  way  the  reader  gets  trained  to  make  the 
proper  decisions  in  econometric  modelling. 

This  book  has  grown  out  of  half  a  century  of 
experience  in  teaching  undergraduate  economet¬ 
rics  at  the  Econometric  Institute  in  Rotterdam. 
With  the  support  of  Jan  Tinbergen,  Henri  Theil 
founded  the  institute  in  1956  and  he  developed 
Econometrics  into  a  full-blown  academic  pro¬ 
gramme.  Originally,  econometrics  was  mostly 
concerned  with  national  and  international  macro- 
economic  policy;  the  required  computing  power 
to  estimate  econometric  models  was  expensive 
and  scarcely  available,  so  that  econometrics  was 
almost  exclusively  applied  in  public  (statistical) 
agencies.  Much  has  changed,  and  nowadays 
econometrics  finds  widespread  application  in  a 
rich  variety  of  fields.  The  two  major  causes 
of  this  increased  role  of  econometrics  are  the 


information  explosion  in  business  and  economics 
(with  large  data  sets  —  for  instance,  in  finance 
and  marketing)  and  the  enormous  growth  in 
cheap  computing  power  and  user-friendly  soft¬ 
ware  for  a  wide  range  of  econometric  methods. 

This  development  is  reflected  in  the  book,  as  it 
presents  econometric  methods  as  a  collection  of 
very  useful  tools  to  address  issues  in  a  wide  range 
of  application  areas.  First  of  all,  students  should 
learn  the  essentials  of  econometrics  in  a  rigorous 
way,  as  this  forms  the  indispensable  basis  for  all 
valid  practical  work.  These  essentials  are  treated 
in  Chapters  1-5,  after  which  two  major  applica¬ 
tion  areas  are  discussed  in  Chapter  6  (on  individ¬ 
ual  choice  data  with  applications  in  marketing 
and  microeconomics)  and  Chapter  7  (on  time 
series  data  with  applications  in  finance  and  inter¬ 
national  economics).  The  Introduction  provides 
more  information  on  the  motivation  and  con¬ 
tents  of  the  book,  together  with  advice  for  stu¬ 
dents  and  instructors,  and  the  Guide  to  the  Book 
explains  the  structure  and  use  of  the  book. 

We  thank  our  students,  who  always  stimulate 
our  enthusiasm  to  teach  and  who  make  us  feel 
proud  by  their  achievements  in  their  later  careers 
in  econometrics,  economics,  and  business  man¬ 
agement.  We  also  thank  both  current  and 
former  members  of  the  Econometric  Institute  in 
Rotterdam  who  have  inspired  our  econometric 
work. 

Several  people  helped  us  in  the  process  of 
writing  the  book  and  the  solutions  manual. 
First  of  all  we  should  mention  our  colleague 
Zsolt  Sandor  and  our  (current  and  former)  Ph.D. 
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students  Charles  Bos,  Lennart  Hoogerheide, 
Rutger  van  Oest,  and  Bjorn  Vroomen,  who  all 
contributed  substantially  in  producing  the  solu¬ 
tions  manual.  Further  we  thank  our  (current 
and  former)  colleagues  at  the  Econometric  Insti¬ 
tute,  Bas  Donkers,  Rinse  Harkema,  Johan 
Kaashoek,  Frank  Kleibergen,  Richard  Kleijn, 
Peter  Kooiman,  Marius  Ooms,  and  Peter 
Schotman.  We  were  assisted  by  our  (former) 
students  Arjan  van  Dijk,  Alex  Hoogendoorn, 
and  Jesse  de  Klerk,  and  we  obtained  very  helpful 
feedback  from  our  students,  in  particular  from 
Simone  Jansen,  Martijn  de  Jong,  Marielle  Non, 


Arnoud  Pijls,  and  Gerard  Voskuil.  Special 
thanks  are  for  Aletta  Henderiks,  who  never  lost 
her  courage  in  giving  us  the  necessary  secretarial 
support  in  processing  the  manuscript.  Finally  we 
wish  to  thank  the  delegates  and  staff  of  Oxford 
University  Press  for  their  assistance,  in  particular 
Andrew  Schuller,  Arthur  Attwell,  and  Hilary 
Walford. 

Christiaan  Heij,  Paul  de  Boer,  Philip  Hans 
Franses,  Teun  Kloek,  Herman  K.  van  Dijk 

Rotterdam,  2004 
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Guide  to  the  Book 


This  guide  describes  the  organization  and  use  of  the  book.  We  refer  to  the 
Introduction  for  the  purpose  of  the  book,  for  a  synopsis  of  the  contents  of  the 
book,  for  study  advice,  and  for  suggestions  for  instructors  as  to  how  the  book 
can  be  used  in  different  courses. 


Learning  econometrics:  Why,  what,  and  how 

The  learning  student  is  confronted  with  three  basic  questions:  Why  should 
I  study  this?  What  knowledge  do  I  need?  How  can  I  apply  this  knowledge 
in  practice?  Therefore  the  topics  of  the  book  are  presented  in  the  following 
manner: 

•  explanation  by  motivating  examples; 

•  discussion  of  appropriate  econometric  models  and  methods; 

•  illustrative  applications  in  practical  examples; 

•  training  by  empirical  exercises  (using  an  econometric  software  package); 

•  optional  deeper  understanding  (theory  text  parts  and  theory  and  simulation 
exercises). 

The  book  can  be  used  for  applied  courses  that  focus  on  the  ‘how’  of  economet¬ 
rics  and  also  for  more  advanced  courses  that  treat  both  the  ‘how’  and  the  ‘what’ 
of  econometrics.  The  user  is  free  to  choose  the  desired  balance  between  econo¬ 
metric  applications  and  econometric  theory. 

•  In  applied  courses,  the  theory  parts  (clearly  marked  in  the  text)  and  the  theory 
and  simulation  exercises  can  be  skipped  without  any  harm.  Even  without 
these  parts,  the  text  still  provides  a  good  understanding  of  the  ‘what’  of 
econometrics  that  is  required  in  sound  applied  work,  as  there  exist  no  standard 
‘how-to-do’  recipes  that  can  be  applied  blindly  in  practice. 

•  In  more  advanced  courses,  students  get  a  deeper  understanding  of  econo¬ 
metrics —  in  addition  to  the  practical  skills  of  applied  courses  —  by  studying 
also  the  theory  parts  and  by  doing  the  theory  and  simulation  exercises.  This 
allows  them  to  apply  econometrics  in  new  situations  that  require  a  creative 
mind  in  developing  alternative  models  and  methods. 
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Text  structure 

The  required  background  material  is  covered  in  Chapter  1  (which  reviews 
statistical  methods  that  are  fundamental  in  econometrics)  and  in  Appendix  A 
(which  summarizes  useful  matrix  methods,  together  with  computational 
examples).  The  core  material  on  econometrics  is  in  Chapters  2-7;  Chapters  2- 
5  treat  fundamental  econometric  methods  that  are  needed  for  the  topics  dis¬ 
cussed  in  Chapters  6  and  7.  Each  chapter  has  the  following  structure. 

•  The  chapter  starts  with  a  brief  statement  of  the  purpose  of  the  chapter, 
followed  by  sections  and  subsections  that  are  divided  into  manageable  parts 
with  clear  headings. 

•  Examples,  theory  parts,  and  computational  schemes  are  clearly  indicated  in 
the  text. 

•  Summaries  are  included  at  many  points  —  especially  at  the  end  of  all  sections 
in  Chapters  5-7. 

•  The  chapter  concludes  with  a  brief  summary,  further  reading,  and  a  keyword 
list  that  summarizes  the  treated  topics. 

•  A  varied  set  of  exercises  is  included  at  the  end  of  each  chapter. 

To  facilitate  the  use  of  the  book,  the  required  preliminary  knowledge  is  indicated 
at  the  start  of  subsections. 

•  In  Chapters  2-4  we  refer  to  the  preliminary  knowledge  needed  from  Chapter 
1  (on  statistics)  and  Appendix  A  (on  matrix  methods).  Therefore,  it  is  not 
necessary  to  cover  all  Chapter  1  before  starting  on  the  other  chapters,  as 
Chapter  1  can  be  reviewed  along  the  way  as  one  progresses  through  Chapters 
2-4,  and  the  same  holds  true  for  the  material  of  Appendix  A. 

•  In  Chapters  6  and  7  we  indicate  which  parts  of  the  earlier  chapters  are  needed 
at  each  stage.  Most  of  the  sections  of  Chapter  5  can  be  read  independently  of 
each  other,  and  in  Chapters  6  and  7  some  sections  can  be  skipped  depending 
on  the  topics  of  interest  for  the  reader. 

•  Further  details  of  the  text  structure  are  discussed  in  the  Introduction  (see  the 
section  ‘Teaching  suggestions’  —  in  particular,  Exhibit  0.3). 


Examples  and  data  sets 

The  econometric  models  and  methods  are  motivated  by  means  of  fully 
worked-out  examples  using  real-world  data  sets  from  a  variety  of  applications 
in  business  and  economics.  The  examples  are  clearly  marked  in  the  text  because 
they  play  a  crucial  role  in  explaining  the  application  of  econometric  methods. 
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The  corresponding  data  sets  are  available  from  the  web  site  of  the  book,  and 
Appendix  B  explains  the  type  and  source  of  the  data  and  the  meaning  of  the 
variables  in  the  data  files  (see  p.  748  for  a  list  of  all  the  data  sets  used  in 
the  book).  The  names  of  the  data  sets  consist  of  three  parts: 

•  XM  (for  examples)  and  XR  (for  exercises); 

•  three  digits,  indicating  the  example  or  exercise  number; 

•  three  letters,  indicating  the  data  topic. 

For  example,  the  file  XM101STU  contains  the  data  for  Example  1.1  on  student 
learning,  and  the  file  XR111STU  contains  the  data  for  Exercise  1.11  on 
student  learning. 


Exercises 

Students  will  enhance  their  understanding  and  acquire  practical  skills  by 

working  through  the  exercises,  which  are  of  three  types. 

•  Theory  exercises  on  derivations  and  model  extensions.  These  exercises  deepen 
the  theoretical  understanding  of  the  ‘what’  of  econometrics.  The  desired  level 
of  the  course  will  determine  how  many  of  the  theory  exercises  should  be 
covered. 

•  Simulation  exercises  illustrating  statistical  properties  of  econometric  models 
and  methods.  These  exercises  provide  more  intuitive  understanding  of  some  of 
the  central  theoretical  results. 

•  Empirical  exercises  on  applications  with  business  and  economic  data  sets  to 
solve  questions  of  practical  interest.  These  exercises  focus  on  the  ‘how’  of 
econometrics,  so  that  the  student  learns  to  construct  appropriate  models  from 
real-world  data  and  to  draw  sound  conclusions  from  the  obtained  results. 
Actively  working  through  these  empirical  exercises  is  essential  to  gaining  a 
proper  understanding  of  econometrics  and  to  getting  hands-on  experience 
with  applications  to  solve  practical  problems.  The  web  site  of  the  book 
contains  the  data  sets  of  all  empirical  exercises,  and  Appendix  B  contains 
information  on  these  data  sets. 

The  choice  of  appropriate  exercises  is  facilitated  by  cross-references. 

•  Each  subsection  concludes  with  a  list  of  exercises  related  to  the  material  of 
that  subsection  (where  T  denotes  theory  exercises,  S  simulation  exercises,  and 
E  empirical  exercises). 

•  Every  exercise  refers  to  the  parts  of  the  chapter  that  are  needed  for  doing  the 
exercise. 


•  An  asterisk  (*)  denotes  advanced  (parts  of)  exercises. 
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Web  site  and  software 

The  web  site  of  the  book  contains  all  the  data  sets  used  in  the  book,  in  three 
formats: 

•  EViews; 

•  Excel; 

•  ASCII. 

All  the  examples  and  all  the  empirical  and  simulation  exercises  in  the  book  can 
be  done  with  EViews  version  3.1  and  higher  (Quantitative  Micro  Software, 
1994-8),  but  other  econometric  software  packages  can  also  be  used  in  most 
cases.  The  student  version  of  the  EViews  package  suffices  for  most  of  the  book, 
but  this  version  has  some  limitations  —  for  example,  it  does  not  support  the 
programs  required  for  the  simulation  exercises  (see  the  web  site  of  the  book  for 
further  details).  The  exhibits  for  the  empirical  examples  in  the  text  have  been 
obtained  by  using  EViews  version  3.1. 


Instructor  material 

Instructors  who  adopt  the  book  can  receive  the  Solutions  Manual  of  the  book  for 
free. 

•  The  manual  contains  over  350  pages  with  fully  worked-out  text  solutions  of 
all  exercises,  both  of  the  theory  questions  and  of  the  empirical  and  simulation 
questions;  this  will  assist  instructors  in  selecting  material  for  exercise  sessions 
and  computer  sessions  as  part  of  their  course. 

•  The  manual  contains  a  CD-ROM  with  solution  files  (EViews  work  files  with 
the  solutions  of  all  empirical  exercises  and  EViews  programs  for  all  simulation 
exercises). 

•  This  CD-ROM  also  contains  all  the  exhibits  of  the  book  (in  Word  format)  to 
facilitate  lecture  presentations. 

The  printed  solutions  manual  and  CD-ROM  can  be  obtained  from  Oxford 
University  Press,  upon  request  by  adopting  instructors.  For  further  information 
and  additional  material  we  refer  readers  to  the  Oxford  University  Press  web  site 
of  the  book. 

Remarks  on  notation 

In  the  text  we  follow  the  notational  conventions  commonly  used  in  econometrics. 

•  Scalar  variables  and  vectors  are  denoted  by  lower-case  italic  letters  (x,  y,  and 
so  on);  however,  in  Section  7.6  vectors  of  variables  are  denoted  by  upper-case 
italic  letters,  such  as  Yt,  in  accordance  with  most  of  the  literature  on  this  topic. 

•  Matrices  are  denoted  by  upper-case  italic  letters  (X,  A,  and  so  on). 
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•  The  element  in  row  i  and  column  j  of  a  matrix  A  is  generally  denoted  by  aij, 
except  for  the  regressor  matrix  X,  where  this  element  is  denoted  by  x7„  which 
is  observation  i  of  variable  /  (see  Section  3.1.2). 

•  x,  denotes  the  vector  containing  the  values  of  all  the  explanatory  variables  Xj, 
for  observation  i  (including  the  value  1  as  first  element  of  x,  if  the  model 
contains  a  constant  term). 

•  Transposition  is  denoted  by  a  prime  (X1,  x' ,  and  so  on). 

•  Unknown  parameters  are  denoted  by  Greek  italic  letters  (/l,  e,  a,  and  so  on). 

•  Estimated  quantities  are  denoted  by  Latin  italic  letters  (b,  e,  s,  and  so  on),  or 
sometimes  by  imposing  a  hat  (/?,  e,  a,  and  so  on). 

•  Expected  values  are  denoted  by  E[  •  ]  — for  instance,  E[b]. 

•  log  (x)  denotes  the  natural  logarithm  of  x  (with  base  e  —  2.71828  . . . ). 

In  many  of  the  exhibits  —  for  instance,  the  ones  related  to  empirical  examples  — 

we  show  the  output  as  generated  by  the  software  program  EViews.  The  notation 

in  these  exhibits  may  differ  from  the  above  conventions. 

•  Scalar  variables  are  denoted  by  capital  letters  (X,  Y,  instead  of  x,  y,  and  so  on). 

•  Statistics  are  denoted  by  text  (R-squared,  Std.  Dev.,  instead  of  R 2,  s,  and  so 
on). 

In  most  cases  this  does  not  lead  to  any  confusion,  and  otherwise  the  notation  is 

explained  in  the  text  or  in  the  caption  of  the  exhibits. 
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Econometrics 

Decision  making  in  business  and  economics  is  often  supported  by  the  use  of 
quantitative  information.  Econometrics  is  concerned  with  summarizing  rele¬ 
vant  data  information  by  means  of  a  model.  Such  econometric  models  help  to 
understand  the  relation  between  economic  and  business  variables  and  to 
analyse  the  possible  effects  of  decisions. 

Econometrics  was  founded  as  a  scientific  discipline  around  1930.  In 
the  early  years,  most  applications  dealt  with  macroeconomic  questions 
to  help  governments  and  large  firms  in  making  their  long-term  decisions. 
Nowadays  econometrics  forms  an  indispensable  tool  to  model  empirical 
reality  in  almost  all  economic  and  business  disciplines.  There  are  three 
major  reasons  for  this  increasing  attention  for  factual  data  and  econometric 
models. 

•  Economic  theory  often  does  not  give  the  quantitative  information  that  is 
needed  in  practical  decision  making. 

•  Relevant  quantitative  data  are  available  in  many  economic  and  business 
disciplines. 

•  Realistic  models  can  easily  be  solved  by  modern  econometric  techniques  to 
support  everyday  decisions  of  economists  and  business  managers. 

In  areas  such  as  finance  and  marketing,  quantitative  data  (on  price  move¬ 
ments,  sales  patterns,  and  so  on)  are  collected  on  a  regular  basis,  weekly, 
daily,  or  even  every  split  second.  Much  information  is  also  available  in 
microeconomics  (for  instance,  on  the  spending  behaviour  of  households). 
Econometric  techniques  have  been  developed  to  deal  with  all  such  kinds  of 
information. 

Econometrics  is  an  interdisciplinary  field.  It  uses  insights  from  economics 
and  business  in  selecting  the  relevant  variables  and  models,  it  uses  computer- 
science  methods  to  collect  the  data  and  to  solve  econometric  models,  and  it 
uses  statistics  and  mathematics  to  develop  econometric  methods  that  are 
appropriate  for  the  data  and  the  problem  at  hand.  The  interplay  of  these 
disciplines  in  econometric  modelling  is  summarized  in  Exhibit  0.1. 
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Exhibit  0.1  Econometrics  as  an  interdisciplinary  field 


Purpose  of  the  book 

The  book  gives  the  student  a  sound  introduction  into  modern  econometrics. 
The  student  obtains  a  solid  understanding  of  econometric  methods  and  an 
active  training  in  econometrics  as  it  is  applied  in  practice.  This  involves  the 
following  steps. 

1.  Question.  Formulate  the  economic  and  business  questions  of  central 
interest. 

2.  Information.  Collect  and  analyse  relevant  statistical  data. 

3.  Model.  Formulate  and  estimate  an  appropriate  econometric  model. 

4.  Analysis.  Analyse  the  empirical  validity  of  the  model. 

5.  Application.  Apply  the  model  to  answer  the  questions  and  to  support 
decisions. 

These  steps  are  shown  in  Exhibit  0.2.  Steps  1,  2,  and  5  form  the  applied  part 
of  econometrics  and  steps  3  and  4  the  theoretical  part.  Although  econometric 
models  and  methods  differ  according  to  the  nature  of  the  data  and  the  type  of 
questions  under  investigation,  all  applications  share  this  common  structure. 

As  the  title  of  the  book  indicates,  it  discusses  econometric  methods  (tools  for 
the  formulation,  estimation,  and  diagnostic  analysis  of  econometric  models) 
that  are  motivated  and  illustrated  by  applications  in  business  and  economics 
(to  answer  practical  questions  that  support  decisions  by  means  of  relevant 
quantitative  data  information).  The  book  provides  a  rigorous  and  self- 
contained  treatment  of  the  central  methods  in  econometrics  in  Chapters  1-5. 
This  provides  the  student  with  a  thorough  understanding  of  the  central  ideas 
and  their  practical  application.  Two  major  application  areas  are  discussed  in 
more  detail  —  that  is,  models  for  individual  economic  behaviour  (with  appli¬ 
cations  in  marketing  and  microeconomics)  in  Chapter  6  and  models  for  time 
series  data  (with  applications  in  finance  and  macroeconomics)  in  Chapter  7. 

The  book  is  selective,  as  its  purpose  is  not  to  give  an  exhaustive  encyclo¬ 
paedic  overview  of  all  available  methods.  The  thorough  treatment  of  the 
selected  topics  not  only  enables  the  student  to  apply  these  methods  success¬ 
fully  in  practice;  it  also  gives  an  excellent  preparation  for  understanding  and 
applying  econometrics  in  other  application  areas. 
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Exhibit  0.2  Econometric  modelling 


Characteristic  features  of  the  book 

Over  recent  years  several  new  and  refreshing  econometric  textbooks  have 
appeared.  Our  book  is  characterized  by  its  thorough  discussion  of  core  econo¬ 
metrics  motivated  and  illustrated  by  real-world  examples  from  a  broad  range 
of  economic  and  business  applications.  In  all  our  discussions  of  econometric 
topics  we  stress  the  interplay  between  real-world  applications  and  the  prac¬ 
tical  need  for  econometric  models  and  methods.  This  twofold  serious  atten¬ 
tion  for  methods  and  applications  is  also  reflected  in  the  extensive  exercise 
sections  at  the  end  of  each  chapter,  which  contain  both  theory  questions  and 
empirical  questions.  Some  characteristic  features  of  the  book  follow. 

•  The  book  is  of  an  academic  level  and  it  is  rigorous  and  self-contained. 
Preliminary  topics  in  statistics  are  reviewed  in  Chapter  1,  and  required 
matrix  methods  are  summarized  in  Appendix  A. 

•  The  book  gives  a  sound  and  solid  training  in  basic  econometric  thinking 
and  working  in  Chapters  1-5,  the  basis  of  all  econometric  work. 

•  The  book  presents  deep  coverage  of  key  econometric  topics  rather 
than  exhaustive  coverage  of  all  topics.  Two  major  application  areas  are 
discussed  in  detail  —  namely,  choice  data  (in  marketing  and 
microeconomics)  in  Chapter  6  and  time  series  data  (in  finance  and  inter¬ 
national  economics)  in  Chapter  7. 

•  All  topics  are  treated  thoroughly  and  are  illustrated  with  up-to-date  real- 
world  applications  to  solve  practical  economic  and  business  questions. 

•  The  book  stimulates  active  learning  by  the  examples,  which  show 
econometrics  as  it  works  in  practice,  and  by  extensive  exercise  sets.  The 
theory  and  simulation  exercises  provide  a  deeper  understanding,  and  the 
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empirical  examples  provide  the  student  with  a  working  understanding  and 
hands-on  experience  with  econometrics  in  a  broad  set  of  real-world  eco¬ 
nomic  and  business  data  sets. 

•  The  book  can  be  used  both  in  more  advanced  (graduate)  courses  and  in 
introductory  applied  (undergraduate)  courses,  because  the  more  theoret¬ 
ical  parts  can  easily  be  skipped  without  loss  of  coherence  of  the  exposition. 

•  The  book  supports  the  learning  process  in  many  ways  (see  the  Guide  to  the 
Book  for  further  details). 


Target  audience  and  required  background  knowledge 

As  stated  in  the  Preface,  the  book  is  directed  at  anyone  interested  in  obtaining 
a  solid  understanding  and  active  working  knowledge  of  econometrics  as  it 
works  in  the  daily  practice  of  business  and  economics. 

The  book  builds  up  econometrics  from  its  fundamentals  in  simple  models 
to  modern  applied  research.  It  does  not  require  any  prior  course  in  econo¬ 
metrics.  The  book  assumes  a  good  working  knowledge  of  basic  statistics  and 
some  knowledge  of  matrix  algebra.  An  overview  of  the  required  statistical 
concepts  and  methods  is  given  in  Chapter  1,  which  is  meant  as  a  refresher 
and  which  requires  a  preliminary  course  in  statistics.  The  required  matrix 
methods  are  summarized  in  Appendix  A. 

Brief  contents  of  the  book 

The  contents  can  be  split  into  four  parts:  Chapter  1  (review  of  statistics), 
Chapters  2-4  (model  building),  Chapter  5  (model  evaluation),  and  Chapters 
6  and  7  (selected  application  areas). 

Chapter  1  reviews  the  statistical  material  needed  in  later  chapters.  It  serves 
as  a  refresher  for  students  with  some  background  in  statistics.  The  chapter 
discusses  the  concepts  of  random  variables  and  probability  distributions  and 
methods  of  estimation  and  testing. 

Basic  econometric  methods  are  described  in  Chapters  2-4.  Chapters  2  and 
3  treat  a  relatively  simple  yet  very  useful  model  that  is  much  applied  in 
practice  —  namely,  the  linear  regression  model.  The  statistical  properties  of 
the  least  squares  method  are  derived  under  a  number  of  assumptions.  The 
multiple  regression  model  in  Chapter  3  is  formulated  in  matrix  terms, 
because  this  enables  an  analysis  by  means  of  transparent  and  efficient  matrix 
methods  (summarized  in  Appendix  A).  In  Chapter  4  we  extend  Chapters  2 
and  3  to  non-linear  models  and  we  discuss  the  maximum  likelihood  method 
and  the  generalized  method  of  moments.  The  corresponding  estimates  can  be 
computed  by  numerical  optimization  procedures  and  statistical  properties 
can  be  derived  under  the  assumption  that  a  sufficient  number  of  observations 
are  available. 
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Chapter  5  discusses  a  set  of  diagnostic  instruments  that  play  a  crucial  role  in 
obtaining  empirically  valid  models.  Along  with  tests  on  the  correct  specifica¬ 
tion  of  the  regression  model,  we  also  discuss  several  extensions  that  are  often 
used  in  practice.  This  involves,  for  instance,  models  with  varying  parameters, 
the  use  of  dummy  variables  in  regression  models,  robust  estimation  methods, 
instrumental  variables  methods,  models  for  changing  variance  (heteroskedas- 
ticity),  and  dynamic  models  (serial  correlation).  Our  motivation  for  the 
extensive  treatment  of  these  topics  is  that  regression  is  by  far  the  most  popular 
method  for  applied  work.  The  applied  researcher  should  check  whether  the 
required  regression  assumptions  are  valid  and,  if  some  of  the  assumptions  are 
not  acceptable,  he  or  she  should  know  how  to  proceed  to  improve  the  model. 
Chapter  5  forms  the  bridge  between  the  basic  methods  in  Chapters  2-4  and 
the  application  areas  discussed  in  Chapters  6  and  7.  The  sections  of  Chapter  5 
can  be  read  independently  from  each  other,  and  it  is  not  necessary  to  study  all 
Chapter  5  before  proceeding  with  the  applications  in  Chapters  6  and  7. 

In  Chapters  6  and  7  we  discuss  econometric  models  and  methods  for  two 
major  application  areas  —  namely,  discrete  choice  models  and  models  for  time 
series  data.  These  two  chapters  can  be  read  independently  from  each  other. 
Chapter  6  concerns  individual  decision  making  with  applications  in  marketing 
and  microeconomics.  We  discuss  logit  and  probit  models  and  models  for 
truncated  and  censored  data  and  duration  data.  Chapter  7  discusses  univariate 
and  multivariate  time  series  methods,  which  find  many  applications  in  finance 
and  international  economics.  We  pay  special  attention  to  forecasting  methods 
and  to  the  modelling  of  trends  and  changing  variance  in  time  series. 

The  hook  discusses  core  econometrics  and  selected  key  topics.  It  does  not 
provide  an  exhaustive  treatment  of  all  econometric  topics  —  for  instance,  we 
discuss  only  parametric  models  and  we  pay  hardly  any  attention  to  non- 
parametric  or  semi-parametric  techniques.  Our  models  are  relatively  simple 
and  can  be  optimized  in  a  relatively  straightforward  way  —  for  example,  we 
do  not  discuss  optimization  by  means  of  simulation  techniques.  We  pay  only 
brief  attention  to  panel  data  models,  simultaneous  equation  models,  and 
models  with  latent  variables,  to  mention  a  few.  Also  some  aspects  of  signifi¬ 
cant  practical  importance,  such  as  data  collection  and  report  writing,  are  not 
discussed  in  the  book.  As  stated  before,  our  purpose  is  to  give  the  student  a 
profound  working  knowledge  of  core  econometrics  needed  in  good  applied 
work.  We  are  confident  that,  with  the  views  and  skills  acquired  after  studying 
the  book,  the  student  will  be  well  prepared  to  master  the  other  topics  on  his 
or  her  own. 

Study  advice 

In  Chapters  2-4  it  is  assumed  that  the  student  understands  the  statistical 
topics  of  Chapter  1.  The  student  can  check  this  by  means  of  the  keyword  list 
at  the  end  of  Chapter  1.  The  subsections  of  Chapters  2-4  contain  references 
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to  the  corresponding  relevant  parts  of  Chapter  1,  so  that  statistical  topics  that 
are  unknown  or  partly  forgotten  can  be  studied  along  the  way.  The  further 
reading  list  in  Chapter  1  contains  references  to  statistical  textbooks  that  treat 
the  required  topics  in  much  more  detail.  Chapter  2  is  a  fundamental  chapter 
that  prepares  the  ground  for  all  later  chapters.  It  discusses  the  concept  of  an 
econometric  model  and  the  role  of  random  variables  and  it  treats  statistical 
methods  for  estimation,  testing,  and  forecasting.  This  is  extended  in  Chapters 
3  and  4  to  more  general  models  and  methods.  The  best  way  to  study  is 
as  follows. 

1.  Understand  the  general  nature  of  the  practical  question  of  interest. 

2.  Understand  the  model  formulation  and  the  main  methods  of  analysis, 
including  the  model  properties  and  assumptions. 

3.  Train  the  practical  understanding  by  working  through  the  text  examples 
(preferably  using  a  software  package  to  analyse  the  example  data  sets). 

4.  Obtain  active  understanding  by  doing  the  empirical  exercises,  using 
EViews  or  a  similar  econometric  software  package  (the  data  sets  can  be 
downloaded  from  the  web  site  of  the  book). 

5.  Deepen  the  understanding  by  studying  the  theoretical  parts  in  the  main 
text  and  by  doing  the  theory  and  simulation  exercises.  This  provides  a 
better  understanding  of  the  various  model  assumptions  that  are  needed  to 
justify  the  econometric  analysis. 

After  studying  Chapters  2-4  in  this  way,  the  student  is  ready  for  more. 
Several  options  for  further  chapters  are  open,  and  we  refer  to  the  teaching 
suggestions  and  Exhibit  0.3  below  for  further  details. 

Teaching  suggestions 

The  book  is  suitable  both  for  advanced  undergraduate  courses  and  for 
introductory  graduate  courses  in  business  and  economics  programmes.  In 
applied  courses  much  of  the  underlying  theory  can  easily  be  skipped  without 
loss  of  coherence  of  the  exposition  (by  skipping  the  theory  sections  in  the  text 
and  the  theory  and  simulation  exercises).  In  more  advanced  courses,  the 
theory  parts  in  the  text  clarify  the  structure  of  econometric  models  and  the 
role  of  model  assumptions  needed  to  justify  econometric  methods.  The  book 
can  be  used  in  three  types  of  courses. 

•  Advanced  Undergraduate  Course  on  Econometrics.  Focus  on  Chapters 
2-4,  and  possibly  on  parts  of  Chapter  5.  This  material  can  be  covered  in 
one  trimester  or  semester. 

•  Introductory  Graduate  Course  on  Econometrics.  Focus  on  Chapters  2-4, 
and  on  some  parts  of  Chapters  5-7.  This  requires  one  or  two  trimesters  or 


Introduction  7 


semesters,  depending  on  the  background  of  the  students  and  on  the  desired 
coverage  of  topics. 

•  Intermediate  Graduate  Course  on  Econometrics.  Focus  on  Chapters  5-7, 
and  possibly  on  some  parts  of  Chapters  3-4  as  background  material.  This 
requires  one  or  two  trimesters  or  semesters,  depending  on  background  and 
coverage. 

In  all  cases  it  is  necessary  for  the  students  to  understand  the  statistical  topics 
reviewed  in  Chapter  1.  The  keyword  list  at  the  end  of  Chapter  1  summarizes 
the  required  topics  and  the  further  reading  contains  references  to  textbooks 
that  treat  the  topics  in  more  detail.  Chapters  2-4  treat  concepts  and  methods 
that  are  of  fundamental  importance  in  all  econometric  work.  This  material 
can  be  skipped  only  if  the  students  have  already  followed  an  introductory 
course  in  econometrics.  Chapters  5-7  can  be  treated  selectively,  according  to 
the  purposes  of  the  course.  Exhibit  0.3  gives  an  overview  of  the  dependencies 
between  topics.  For  instance,  if  the  aim  is  to  cover  GARCFI  models  (Section 
7.4)  in  the  course,  then  it  will  be  necessary  to  include  the  main  topics  of 
Sections  5. 4-5. 6  and  7.1-7.3. 

The  book  is  suitable  for  different  entrance  levels.  Students  starting  in 
econometrics  will  have  to  begin  at  the  top  of  Exhibit  0.3,  and  the  basics 
(Chapters  2-4  and  possibly  selected  parts  of  Chapter  5)  can  be  treated  in  one 
trimester  or  semester.  Students  with  a  preliminary  background  in  economet¬ 
rics  can  start  somewhere  lower  in  Exhibit  0.3  and  select  different  routes  to 
applied  econometric  areas,  which  can  be  treated  thoroughly  in  one  trimester 
or  semester. 

The  book  leaves  the  teacher  a  lot  of  freedom  to  select  topics,  as  long  as  the 
logical  dependencies  between  the  topics  in  Exhibit  0.3  are  respected.  Our 
advice  is  always  to  pay  particular  attention  to  the  motivation  of  models  and 
methods;  the  examples  in  the  main  text  serve  this  purpose,  and  the  students 
can  get  further  training  by  working  on  the  empirical  exercises  at  the  end  of 
each  chapter.  In  our  own  programme  in  Rotterdam,  the  students  work 
together  in  groups  of  four  to  perform  small-scale  projects  on  the  computer 
by  analysing  data  sets  from  the  book.  We  advise  teachers  always  to  include 
the  following  three  ingredients  in  the  course. 

•  Lectures  on  the  book  material  to  discuss  econometric  models  and  methods 
with  illustrative  text  examples,  preferably  supported  by  a  lecture  room  PC 
to  show  the  data  and  selected  results  of  the  analysis. 

•  Computer  sessions  treating  selected  empirical  and  simulation  exercises  to 
get  hands-on  experience  by  applying  econometrics  to  real-world  economic 
and  business  data. 

•  Exercise  sessions  treating  selected  theory  exercises  to  train  mathematical 
and  statistical  econometric  methods  on  paper. 
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choice  models  with  applications  in 
microeconomics  and  marketing 


time  series  models  with 
applications  in  macroeconomics 
and  finance 


Exhibit  0.3  Book  structure 


Some  possible  course  structures 

For  all  courses  we  suggest  reserving  approximately  the  following  relative 

time  load  for  the  students’  different  activities: 

•  20  per  cent  for  attending  lectures; 

•  20  per  cent  for  computer  sessions  (10  per  cent  guided,  10  per  cent  group 
work); 

•  20  per  cent  for  exercise  sessions  (10  per  cent  guided,  10  per  cent  individual 
work); 

•  40  per  cent  self-study  of  the  book,  including  preparation  of  computer  and 
paper  exercises. 

For  instance,  in  a  twelve-week  trimester  course  with  a  student  load  of  120 

hours,  this  corresponds  basically  to  two  lecture  hours  per  week  and  two 
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exercise  hours  per  week  (half  on  computer  and  half  on  paper).  Taking  this 
type  of  course  as  our  basis,  we  mention  the  following  possible  course  struc¬ 
tures  for  students  without  previous  knowledge  of  econometrics. 

(a)  Introductory  Econometrics  (single  course  on  basics,  120  hours):  Chap¬ 
ters  2-4. 

( b )  Introductory  Econometrics  (extended  course  on  basics,  180  hours): 
Chapters  2-5. 

(c)  Econometrics  with  Applications  in  Marketing  and  Microeconomics 
(double  course,  240  hours):  Chapters  2-4,  Sections  5.4  and  5.6,  and 
Chapter  6. 

(d)  Econometrics  with  Applications  in  Finance  and  Macroeconomics 
(double  course,  240  hours):  Chapters  2-4,  Sections  5. 4-5. 7,  and 
Chapter  7. 

(e)  Econometrics  with  Applications  (double  course,  240  hours):  Chapters 
2-4  and  Sections  5. 4-5. 7,  6.1  and  6.2,  7.1-7.6. 

(f)  Econometrics  with  Applications  (extended  double  course,  300  hours): 
Chapters  2-7. 

The  book  is  also  suitable  for  a  second  course,  after  an  undergraduate  intro¬ 
ductory  course  in  econometrics.  The  book  can  then  be  used  as  a  graduate  text 
by  skipping  most  of  Chapters  2-4  and  choosing  one  of  the  options  (c)-(f) 
above. 

(c2)  Econometric  Applications  in  Marketing  and  Microeconomics  (single 
course,  120  hours):  parts  of  Chapters  3  and  4,  Sections  5.4  and  5.6,  and 
Chapter  6. 

(d2)  Econometric  Applications  in  Finance  and  Macroeconomics  (single 
course,  120  hours):  parts  of  Chapters  3  and  4,  Sections  5.4-5. 7,  and 
Chapter  7. 

(e2)  Econometric  Applications  (single  course,  120  hours):  parts  of  Chapters 
3  and  4  and  Sections  5.4-5. 7,  6.1  and  6.2,  7. 1-7.6. 

(fl)  Econometric  Applications  (extended  or  double  course,  180-240  hours): 
parts  of  Chapters  3  and  4,  and  Chapters  5-7. 

In  Rotterdam  we  use  the  book  for  undergraduate  students  in  econometrics 
and  we  basically  follow  option  (e)  above.  This  is  a  second-year  course  for 
students  who  followed  introductory  courses  in  statistics  and  linear  algebra  in 
their  first  year.  We  also  use  the  book  for  first-year  graduate  students  in 
economics  in  Rotterdam  and  Amsterdam.  Here  we  also  basically  follow 
option  (e),  but,  as  the  course  load  is  160  hours,  we  focus  on  practical  aspects 
and  skip  most  of  the  theory  parts. 
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Review  of  Statistics 


A  first  step  in  the  econometric  analysis  of  economic  data  is  to  get  an  idea  of 
the  general  pattern  of  the  data.  Graphs  and  sample  statistics  such  as  mean, 
standard  deviation,  and  correlation  are  helpful  tools.  In  general,  economic 
data  are  partly  systematic  and  partly  random.  This  motivates  the  use  of 
random  variables  and  distribution  functions  to  describe  the  data.  This  chap¬ 
ter  pays  special  attention  to  data  obtained  by  random  sampling,  where  the 
observations  are  mutually  independent  and  come  from  an  underlying  popu¬ 
lation  with  fixed  mean  and  standard  deviation.  The  concepts  and  methods 
for  this  relatively  simple  situation  form  the  building  blocks  for  dealing  with 
more  complex  models  that  are  relevant  in  practice  and  that  will  be  discussed 
in  later  chapters. 
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1.1  Descriptive  statistics 


i.i.i  Data  graphs 

First  used  in  Section  2.1.1. 


d»?A*'o 

V  t  : 
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Data 

Economic  data  sets  may  contain  a  large  number  of  observations  for 
many  variables.  For  instance,  financial  investors  can  analyse  the  patterns  of 
many  individual  stocks  traded  on  the  stock  exchange;  marketing  departments 
get  very  detailed  information  on  individual  buyers  from  scanner  data;  and 
national  authorities  have  detailed  data  on  import  and  export  flows  for  many 
kinds  of  goods.  It  is  often  useful  to  summarize  the  information  in  some  way. 
In  this  section  we  discuss  some  simple  graphical  methods  and  in  the  next 
section  some  summary  statistics. 

Example  1.1:  Student  Learning 

As  an  example,  we  consider  in  this  chapter  a  data  set  on  student  learning. 
These  data  were  analysed  by  J.  S.  Butler,  T.  A.  Finegan,  and  J.  J.  Siegfried  in 
their  paper  ‘Does  More  Calculus  Improve  Student  Learning  in  Intermediate 
Micro-  and  Macroeconomic  Theory’  (Journal  of  Applied  Econometrics,  13/2 
(1998),  185-202).  This  data  set  contains  information  on  609  students  of  the 
Vanderbilt  University  in  the  USA.  In  total  there  are  thirty-one  observed 
variables,  so  that  the  data  set  consists  of  18,879  numbers.  In  this  chapter 
we  restrict  the  attention  to  four  variables  —  that  is,  FGPA  (the  overall  grade 
point  average  at  the  end  of  the  freshman  year,  on  a  scale  from  0  to  4),  SATM 
(the  score  on  the  SAT  mathematics  test  divided  by  100,  on  a  scale  from  0  to 
10),  SATV  (the  score  on  the  SAT  verbal  test  divided  by  100,  on  a  scale  from  0 
to  10),  and  FEM  (with  value  1  for  females  and  value  0  for  males).  A  part  of 
the  corresponding  data  table  is  given  in  Exhibit  1.1.  (We  refer  readers  to 
Appendix  B  for  further  details  on  the  data  sets  and  corresponding  notation  of 
variables  used  in  this  book.) 

Graphs 

The  data  can  be  visualized  by  means  of  various  possible  graphs.  A  histogram 
of  a  variable  consists  of  a  two-dimensional  plot.  On  the  horizontal  axis,  the 
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Obs. 

FGPA 

SATM 

SATV 

FEM 

1 

3.125 

6.6 

5.5 

0 

2 

1.500 

6.7 

7.0 

0 

3 

2.430 

6.6 

6.0 

0 

4 

3.293 

6.1 

5.4 

1 

5 

2.456 

6.5 

5.2 

0 

6 

2.806 

6.5 

5.4 

1 

7 

2.455 

6.2 

4.8 

0 

8 

3.168 

6.2 

4.6 

0 

9 

2.145 

4.3 

4.7 

0 

10 

2.700 

6.1 

5.6 

0 

11 

3.296 

5.8 

5.1 

1 

12 

2.240 

6.4 

5.5 

1 

608 

2.996 

6.6 

6.5 

1 

609 

2.133 

6.9 

6.2 

0 

Exhibit  1.1  Student  Learning  (Example  1.1) 

Part  of  data  on  609  students  on  FGPA  (grade  point  average  at  the  end  of  the  freshman  year), 
SATM  (scaled  score  on  SAT  mathematics  test),  SATV  (scaled  score  on  SAT  verbal  test),  and 
FEM  (1  for  females,  0  for  males). 


outcome  range  of  the  variable  is  divided  into  a  number  of  intervals.  In  the 
case  of  intervals  with  equal  width,  the  value  on  the  vertical  axis  measures  the 
number  of  observations  of  the  variable  that  have  an  outcome  in  that  particu¬ 
lar  interval.  The  sample  cumulative  distribution  function  (SCDF)  is  repre¬ 
sented  by  a  two-dimensional  plot  with  the  outcome  range  of  the  variable  on 
the  horizontal  axis.  For  each  value  v  in  this  range,  the  function  value  on  the 
vertical  axis  is  the  fraction  of  the  observations  with  an  outcome  smaller  than 
or  equal  to  v.  To  investigate  possible  dependencies  between  two  variables  one 
can  draw  a  scatter  diagram.  One  variable  is  measured  along  the  horizontal 
axis,  the  other  along  the  vertical  axis,  and  the  plot  consists  of  points  repre¬ 
senting  the  joint  outcomes  of  the  two  variables  that  occur  in  the  data  set. 

Example  1 .2:  Student  Learning  (continued) 

Exhibit  1.2  shows  histograms  {a,  c,  e)  and  SCDFs  (b,  d,  f)  of  the  variables 
FGPA,  SATM,  and  SATV,  and  scatter  diagrams  of  FGPA  against  SATM  (g), 
FGPA  against  SATV  (h),  and  SATM  against  SATV  (i).  The  scatter  diagrams 
show  much  variation  in  the  outcomes.  In  this  example  it  is  not  so  easy  to 
determine  from  the  diagrams  whether  the  variables  are  related  or  not. 
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More  than  two  variables 

For  three  variables  it  is  possible  to  plot  a  three-dimensional  scatter  cloud,  but 
such  graphs  are  often  difficult  to  read.  Instead  three  two-dimensional  scatter 
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SATM  SATV  SATV 

Exhibit  1.2  Student  Learning  (Example  1.2) 


Histograms  and  sample  cumulative  distribution  functions  of  FGPA  ((«)-(£>)),  SATM  ((c)-(d)), 
and  SATV  ((e)-(f)),  and  scatter  diagrams  ((g)-(i))  of  FGPA  against  SATM  (g),  of  FGPA  against 
SATV  (h),  and  of  SATM  against  SATV  (i). 
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diagrams  can  be  used.  The  same  idea  applies  for  four  or  more  variables.  It 
should  be  realized,  however,  that  histograms  and  scatter  diagrams  provide 
only  partial  information  if  there  are  more  than  one  or  two  variables.  The 
shape  of  these  diagrams  will  partly  be  determined  by  the  neglected  variables, 
but  the  influence  of  these  variables  cannot  be  detected  from  the  diagrams.  In 
Example  1.3  we  give  an  illustration  of  the  possible  effects  of  such  a  partial 
analysis.  One  of  the  main  purposes  of  econometric  modelling  is  to  disentan¬ 
gle  the  mutual  dependencies  between  a  group  of  variables. 

Example  1 .3:  Student  Learning  (continued) 

The  histogram  of  FGPA  shows  a  spread  that  is  partly  caused  by  differences  in 
the  learning  abilities  of  the  students.  If  they  had  differed  less  on  their  SATM 
and  SATV  scores,  then  they  would  possibly  have  had  less  different  FGPA 
outcomes.  As  an  example,  Exhibit  1.3  shows  histograms  for  two  groups  of 
students.  The  609  students  are  ordered  by  their  average  SAT  score,  defined  as 
SATA  =  0.5(SATM  +  SATV).  The  first  group  consists  of  students  with  low  or 
high  SATA  scores  (rank  numbers  between  1  and  100  and  between  510  and 
609)  and  the  second  group  with  middle  SATA  scores  (rank  numbers  between 
205  and  405).  As  expected,  the  spread  of  the  FGPA  scores  in  the  first  group 
(see  Exhibit  1.3  (a))  is  somewhat  larger  than  that  in  the  second  group  (see 
Exhibit  1.3(b)).  The  difference  is  small,  though,  and  cannot  easily  be  detected 
from  Exhibit  1.3.  In  the  next  section  we  describe  numerical  measures  for  the 
spread  of  data  that  will  simplify  the  comparison. 

In  general,  the  variation  in  one  variable  may  be  partly  caused  by  another 
variable,  which  of  course  cannot  be  detected  from  a  histogram. 


(«)  ( b ) 


Exhibit  1.3  Student  Learning  (Example  1.3) 

Histograms  for  FGPA  scores  of  students  with  100  lowest  and  100  highest  average  SATA  scores 
(a)  and  for  FGPA  scores  of  201  students  with  middle  average  SATA  scores  ( b ). 
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Exercises:  E:  1.11c,  1.13a,  d. 
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1.1.2  Sample  statistics 

First  used  in  Section  2.1.1;  uses  Appendix  A.l. 


Sample  moments 

For  a  single  variable,  the  shape  of  the  histogram  is  often  summarized  by 
measures  of  location  and  dispersion.  Let  the  number  of  observations  be 
denoted  by  n  and  let  the  observed  data  points  be  denoted  by  y,  with 
i  =  1, 2,  •  •  • ,  n.  The  sample  mean  is  defined  as  the  average  of  the  observations 
over  the  sample  —  that  is, 


y  =  niy>-  (L1) 

i=l 

The  sample  mean  is  also  called  the  first  sample  moment.  An  alternative 
measure  of  location  is  the  median.  Let  the  observations  be  ordered  so  that 
y,  <  yl+ 1  for  i  =  1,  —  1;  then  the  median  is  equal  to  the  middle  obser¬ 

vation  yIJ± i  if  n  is  odd  and  equal  to  \  (y«  +  y|+i)  if  n  is  even. 

A  measure  of  dispersion  is  the  second  sample  moment,  defined  by 

1  ” 

«2  =  —  y'ly;  -y)1-  (1-2) 

n 

i=  t 

For  reasons  that  will  become  clear  later  (see  Example  1.9),  in  practice  one 
often  uses  a  slightly  different  measure  of  dispersion  defined  by 

s2  =  — [—r^2(yi-y)2-  (1-3) 

n  - 1  U 

This  is  called  the  sample  variance,  and  the  sample  standard  deviation  is  equal 
to  s  (the  square  root  of  s2).  The  rth  (centred)  sample  moment  is  defined  by 
mr  =  i  YTi=\  ( Vi  ~  yY  an(i  the  standardized  rth  moment  is  defined  by  mr/sr.  In 
particular,  m^/s2,  is  called  the  skewness  and  m^/s4  the  kurtosis.  The  skewness 
is  zero  if  the  observations  are  distributed  symmetrically  around  the  mean, 
negative  if  the  left  tail  is  longer  than  the  right  tail,  and  positive  if  the  right  tail 
is  longer  than  the  left  tail.  If  the  mean  is  larger  (smaller)  than  the  median,  this 
is  an  indication  of  positive  (negative)  skewness.  The  kurtosis  measures  the 
relative  amount  of  observations  in  the  tails  as  compared  to  the  amount  of 
observations  around  the  mean.  The  kurtosis  is  larger  for  distributions  with 
fatter  tails. 
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Example  1 .4:  Student  Learning  (continued) 

Exhibit  1.4  shows  the  sample  mean,  median,  standard  deviation,  skewness, 
and  kurtosis  of  the  data  on  FGPA  (a),  SATM  ( b ),  and  SATV  (c).  Both  the 
mean  and  the  median  of  the  SATV  scores  are  lower  than  those  of  the  SATM 
scores.  The  tails  of  the  SATM  scores  are  somewhat  fatter  on  the  left,  and  the 
mean  is  smaller  than  the  median.  The  tails  of  FGPA  and  SATV  are  somewhat 
fatter  on  the  right,  and  the  mean  exceeds  the  median.  Of  the  three  variables, 
FGPA  has  the  smallest  kurtosis,  as  it  contains  somewhat  less  observations  in 
the  tails  as  compared  to  SATM  and  SATV.  Further,  returning  to  our  discus¬ 
sion  in  Example  1.3  on  two  groups  of  students,  we  measure  the  spread  of  the 
FGPA  scores  in  both  groups  by  the  sample  standard  deviation.  The  first 
group  of  students  (with  either  low  or  high  average  SATA  scores)  has 
s  =  0.485,  whereas  the  second  group  of  students  (with  middle  average 
SATA  scores)  has  s  =  0.449.  As  expected,  the  standard  deviation  is  larger 
for  the  first,  more  heterogeneous  group  of  students,  but  the  difference  is 
small. 


(a)  (b) 
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5.500000 

7.600000 
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0.672398 
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Exhibit  1.4  Student  Learning  (Example  1.4) 


Summary  statistics  of  FGPA  (a),  SATM  (b),  and  SATV  (c)  of  609  students. 
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Covariance  and  correlation 

The  dependence  between  two  variables  can  be  measured  by  their  common 
variation.  Fet  the  two  variables  be  denoted  by  x  and  y,  with  observed 
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outcome  pairs  (x;,  yi)  for  i  =  1,  •  •  • ,  n.  Let  x  be  the  sample  mean  of  x  and  y 
that  of  y,  and  let  sx  be  the  standard  deviation  of  x  and  sy  that  of  y.  Then  the 
sample  covariance  between  x  and  y  is  defined  by 


$xy 


— y 


(Xi  -  x)[y,  -  y) 


(1.4) 


and  the  sample  correlation  coefficient  by 


°xy 


(1.5) 


When  two  variables  are  positively  correlated,  this  means  that,  on  average, 
relatively  large  observations  on  x  correspond  with  relatively  large  observa¬ 
tions  on  y  and  small  observations  on  x  with  small  observations  on  y.  The 
correlation  coefficient  rxy  always  lies  between  —1  and  +1  and  it  does  not 
depend  on  the  units  of  measurement  (see  Exercise  1.1). 

In  the  case  of  two  or  more  variables,  the  first  and  second  moments  can  be 
summarized  in  vectors  and  matrices  (see  Appendix  A  for  an  overview  of 
results  on  matrices  that  are  used  in  this  book).  When  there  are  p  variables,  the 
corresponding  sample  means  can  be  collected  in  a  p  x  1  vector,  and  when 
denotes  the  sample  covariance  between  the  ;th  and  kth  variable,  then  the 
p  x  p  sample  covariance  matrix  S  is  defined  by 
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/  511 

512  ' 

■  sv\ 

521 

522  ' 

'  S2p 

\5pl 

Sp2  ' 

Spp  ) 

The  diagonal  elements  are  the  sample  variances  of  the  variables.  The  sample 
correlation  coefficients  are  given  by  r^  =  Sjk/^SjjSpp,  and  the  p  x  p  correlation 
matrix  is  defined  similar  to  the  covariance  matrix  by  replacing  the  elements  Sjk 
by  rjk-  As  r77  =  1,  this  matrix  contains  unit  elements  on  the  diagonal. 

Example  1 .5:  Student  Learning  (continued) 

Exhibit  1.5  shows  the  sample  covariance  matrix  (Panel  1)  and  the  sample 
correlation  matrix  (Panel  2)  for  the  four  variables  FGPA,  SATM,  SATV,  and 
FEM.  The  covariances  are  scale  dependent.  The  correlations  do  not  depend 
on  the  scale  of  measurement  and  are  therefore  easier  to  interpret.  The  scores 
on  FGPA,  SATM,  and  SATV  are  all  positively  correlated.  As  compared  with 
males,  females  have  on  average  somewhat  better  scores  on  FGPA  and  SATV 
and  somewhat  lower  scores  on  SATM. 
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Panel  1 

FGPA 

SATM 

SATV 

FEM 

FGPA 

0.211 

0.053 

0.028 

0.040 

SATM 

0.053 

0.354 

0.115 

-0.047 

SATV 

0.028 

0.115 

0.451 

0.011 

FEM 

0.040 

-0.047 

0.011 

0.237 

Panel  2 

FGPA 

SATM 

SATV 

FEM 

FGPA 

1.000 

0.195 

0.092 

0.176 

SATM 

0.195 

1.000 

0.288 

-0.163 

SATV 

0.092 

0.288 

1.000 

0.034 

FEM 

0.176 

-0.163 

0.034 

1.000 

Exhibit  1.5  Student  Learning  (Example  1.5) 

Sample  covariances  (Panel  1)  and  sample  correlations  (Panel  2)  of  FGPA,  SATM,  SATV,  and 
FEM  for  609  students. 


Exercises:  T:  1.1;  E:  1.11a,  b,  d,  1.13b. 
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1.2  Random  variables 


1.2.1  Single  random  variables 

First  used  in  Section  2.2.3. 


Randomness 

The  observed  outcomes  of  variables  are  often  partly  systematic  and  partly 
random.  One  of  the  causes  of  randomness  is  sampling.  For  instance,  the  data 
on  student  scores  in  Example  1.1  concern  a  group  of  609  students.  Other 
data  would  have  been  obtained  if  another  group  of  students  (at  another 
university  or  in  another  year)  had  been  observed. 

Distributions 

A  variable  y  is  called  random  if,  prior  to  observation,  its  outcome  cannot  be 
predicted  with  certainty.  The  uncertainty  about  the  outcome  is  described  by  a 
probability  distribution.  If  the  set  of  possible  outcome  values  is  discrete,  say 
V  =  {vi,V2,  ■■  ■},  then  the  distribution  is  given  by  the  set  of  probabilities 
Pi  =  P[y  =  t'd,  the  probability  of  the  outcome  vt.  These  probabilities  have  the 
properties  that  pi  >  0  and  fFp,  =  1-  The  corresponding  cumulative  distribu¬ 
tion  function  (CDF)  is  given  by  F(v)  =  P[y  <  v\  =  J2{i;Vi<v}  Ph  which  is  a  non¬ 
decreasing  function  with  lim^-oo  F(v)  =  0  and  lim^oo ~F{v)  =  1.  If  the  set  of 
possible  outcomes  is  continuous,  then  the  CDF  is  again  defined  by  P[y  <  v], 
and,  if  this  function  is  differentiable,  then  the  derivative  f(v )  =  is  called 
the  probability  density  function.  It  has  the  properties  that  f(v)>  0  and 
f-ocf(v)dv=l.  Interval  probabilities  are  obtained  from  P[a  <  y  <  b]  = 
F(b)-F(a)  =  faf(v)dv. 

The  CDF  of  a  random  variable  is  also  called  the  population  CDF,  as  it 
represents  the  distribution  of  all  the  possible  outcomes  of  the  variable.  For 
observed  data  y\,---,yn,  the  sample  cumulative  distribution  function 
(SCDF)  of  Section  1.1.1  is  given  by  Fs(v)  =  \  (number  of  y,  <  v). 

Remarks  on  notation 

Some  remarks  on  notation  are  in  order.  In  statistics  one  usually  denotes 
random  variables  by  capital  letters  (for  instance,  Y)  and  observed  outcomes 
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of  random  variables  by  lower-case  letters  (for  instance,  y).  However,  in 
econometrics  it  is  usual  to  reserve  capital  letters  for  matrices  only,  so  that 
the  notation  in  econometrics  differs  from  the  usual  one  in  statistics.  To  avoid 
confusion  with  the  notation  in  later  chapters,  we  use  lower-case  letters  (like 
y)  to  denote  random  variables.  Further,  for  the  random  variable  y  we  denoted 
the  set  of  possible  outcome  values  by  V  and  the  observed  outcome  by  v. 
However,  in  a  sample  of  n  observed  data,  the  observations  are  usually 
denoted  by  y;  with  i  =  1,  •  •  • ,  n.  Prior  to  observation,  the  outcome  of  y,  can 
be  seen  as  a  random  variable.  After  observation,  the  realized  values  could  be 
denoted  by  say  v{yi),  the  outcome  value  of  the  random  variable  y„  but  for 
simplicity  of  notation  we  write  yt  both  for  the  random  variables  and  for  the 
observed  outcomes.  This  notation  is  common  in  econometrics.  We  will  make 
sure  that  it  is  always  clear  from  the  context  what  the  notation  y,  means,  a 
random  variable  (prior  to  observation)  or  an  observed  outcome. 


Mean 

The  distribution  of  a  random  variable  can  be  summarized  by  measures 
of  location  and  dispersion.  If  y  has  a  discrete  distribution,  then  the  (popula¬ 
tion)  mean  is  defined  as  a  weighted  average  over  the  outcome  set  V  with 
weights  equal  to  the  probabilities  p,  of  the  different  outcomes  v,  —  that  is, 

H  =  E[y]  =  viPi-  (1-6) 

The  operator  E  that  determines  the  mean  of  a  random  variable  is  also  called 
the  expectation  operator.  Note  that  the  sample  mean  is  obtained  when  the 
SCDF  is  used.  When  y  has  a  continuous  distribution  with  density  function  f, 
the  mean  is  defined  by 


H  =  E[y]  =  j  vf(v)dv  (1.7) 

(if  an  integral  runs  from  — oo  to  +oo,  we  delete  this  for  simplicity  of 
notation). 


Variance 

The  (population)  variance  is  defined  as  the  mean  of  (y  —  ji)1 .  For  a  discrete 
distribution  this  gives 

<r 2  =  E[(y  -  ti)2]  =  Y  in  ~  il)2Pi 


(1.8) 
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and  for  a  continuous  distribution 

a1  =  E[(y  -  g)2]  =  J  {v  -  f-i)2f(v)dv. 


(1.9) 


The  standard  deviation  a  is  the  square  root  of  the  variance  a2.  The  mean  is 
also  called  the  (population)  first  moment,  and  the  variance  the  second 
(centred)  moment. 

Higher  moments 

The  rth  centred  moment  is  defined  as  the  mean  of  (y  —  n)r  —  that  is  (in  the 
case  of  a  continuous  distribution),  jir  =  E[(y  —  f.i)r]  =  J  (v  —  n)rf(v)dv.  The 
standardized  rth  moment  is  given  by  jir/ <Jr ■  For  r  =  3  this  gives  the  skewness 
and  for  r  =  4  the  kurtosis.  The  sample  moments  of  Section  1.1.2  are  obtained 
by  replacing  the  CDF  by  the  sample  CDF.  Although  the  sample  moments 
always  exist,  this  is  not  always  the  case  for  the  population  moments.  If 
E[\y  —  /i|c]  <  oo,  then  all  the  moments  f.ir  with  r  <  c  exist.  In  particular,  a 
random  variable  with  a  finite  variance  also  has  a  finite  mean. 

Transformations  of  random  variables 

Now  we  consider  the  statistical  properties  of  functions  of  random  variables. 
If  y  is  a  random  variable  and  g  is  a  given  function,  then  z  =  g(y)  is  also  a 
random  variable.  Suppose  that  g  is  invertible  with  inverse  function  y  =  b(z). 
If  y  has  a  discrete  distribution  with  outcomes  {v\,  V2,  •  ■  ■},  then  z  also  has  a 
discrete  distribution  with  outcomes  { w ,•  =  g( ty),  z  =  1,2,--  ■}  and  probabil¬ 
ities  P[z  =  Wj\  =  P[y  =  b(wj) ]  =  pi.  When  y  has  a  continuous  distribution 
with  density  function  fy  and  h  is  differentiable  with  derivative  b' ,  then  z  has 
density  function 


fz(w)  =  fy(Hw)) \h'{w)\  (1.10) 

(see  Exercise  1.3  for  a  special  case).  The  mean  of  z  is  given  by  E[z]  = 
F[g(y)]  =  J2Pig(vi)  in  the  discrete  case  and  by  E[g(y)]  =  f  f(v)g(v)dv  in  the 
continuous  case.  If  g  is  linear,  so  that  g(y)  =  ay  +  b  for  some  constants  a  and 
b,  then  E[ay  +  b]  =  aE[y ]  +  b,  but  if  g  is  not  linear  then  E[g(y)]  ^  g(£[y] ) 
in  general. 


Exercises:  T:  1.3a. 
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1.2.2  Joint  random  variables 

First  used  in  Section  3.1.4;  uses  Appendix  A.2-A.4. 


Two  random  variables 

When  there  are  two  or  more  variables  of  interest,  one  can  consider  their  joint 
distribution.  For  instance,  the  data  set  on  609  student  scores  in  Example  1.1 
contains  the  outcomes  of  mathematics  and  verbal  tests.  The  uncertainty 
about  the  pair  of  outcomes  (x,  y)  on  these  two  tests  can  be  described  by  a 
joint  probability  distribution.  If  the  sets  of  possible  outcome  values  for  x  and 
y  are  both  discrete,  say  V  =  {v\,  V2,  •  ■  ■}  and  W  =  {w\,  wi,  ■  ■  ■},  then  the 
joint  distribution  is  given  by  the  set  of  probabilities  p,j  =  P[x  =  v„  y  =  wj\. 
The  corresponding  cumulative  distribution  function  (CDF)  is  given  by 
F{v,  w)  =  P[x  <  v,  y  <  w\  =  'E{(ij);Vi<v,wj<w}Pii-  If  the  sets  of  possible 
outcomes  are  continuous,  then  the  CDF  is  also  defined  as 
F(v,  w)  =  P[x  <  v,  y  <  w],  and  if  the  second  derivative  of  this  function 
exists,  then  the  corresponding  density  function  is  defined  by 
f(v,  w)  =  9  QyQ™'1-  The  density  function  has  the  properties  f(v,  w)  >  0  and 
f  f  f(v,  w)dvdw  =  1,  and  every  function  with  these  two  properties  describes 
a  joint  probability  distribution. 

When  the  joint  distribution  of  x  and  y  is  given,  the  individual  distributions 
of  x  and  y  (also  called  the  marginal  distributions )  can  be  derived.  The  CDF  Fy 
of  y  is  obtained  from  the  CDF  of  (x,  y)  by  Fy(w)  =  P[y  <  w\  =  F( oo,  tv).  For 
continuous  distributions,  the  corresponding  densities  are  related  by 
fy(w)  =  f  f(v,  w)dv.  Mean  and  variance  of  x  and  y  can  also  be  determined 
in  this  way  —  for  instance,  gy  =  f  fy(w)wdw  =  ff  fly ,  w)wdvdw. 

Covariance  and  correlation 

The  covariance  between  x  and  y  is  defined  (for  continuous  distributions) 
by 


cov(x,  y)  =  E[(x  -  nx)[y  -  py)\  =  j  j  (v  -  px)(w  -  py)f(v,  w)dvdw. 

The  correlation  coefficient  between  x  and  y  is  defined  by 

cov(x,  y) 

Pxy  =  — — — 


(1.11) 
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where  ax  and  <jy  are  the  standard  deviations  of  x  and  y.  The  two  random 
variables  are  called  uncorrelated  if  pxy  =  0.  This  is  equivalent  to  the  condi¬ 
tion  that  E[xy]  =  £[x]£[y]. 


Conditional  distribution 

The  conditional  distribution  of  y  for  given  value  of  x  is  defined  as  follows. 
When  the  distribution  is  discrete  and  the  outcome  x  =  vt  is  given,  with 
P[x  =  Uj\  >  0,  the  conditional  probabilities  are  given  by 


P[y  =  Wj\x  =  v,\ 


P[x  =  Vj,  y  =  Wj] 
P[X  =  Vi] 


Pa 

S/  Pij 


(1.12) 


This  gives  a  new  distribution  for  y,  as  the  conditional  probabilities  sum  up 
(over  /')  to  unity.  For  continuous  distributions,  the  conditional  density  fy \x=v  is 
defined  as  follows  (for  values  of  v  for  which  fx[v)  >  0). 


fy\x=v]tv) 


f(y,  tv) 

fx(v) 


f(y,  tv) 
f  f(v,  w)dw 


(1.13) 


Conditional  mean  and  variance 

The  conditional  mean  and  variance  of  y  for  given  value  x  —  v  are  the  mean 
and  variance  with  respect  to  the  corresponding  conditional  distribution. 
For  instance,  for  continuous  distributions  the  conditional  expectation  is 
given  by 


E[y\x  =  v]  =  j  fylx=v(w)wdw  =  •  (114) 

Note  that  the  conditional  expectation  is  a  function  of  v,  so  that  £[y|x] 
is  a  random  variable  with  density  fx(v).  The  mean  of  this  conditional 
expectation  is  (see  Exercise  1.2) 


E[E[y\x]] 


E[y\x  =  v]fx(v)dv  =  E[y]. 


(1.15) 


In  words,  the  conditional  expectation  £[y|x]  (a  function  of  the  random 
variable  x)  has  the  same  mean  as  the  unconditional  random  variable  y.  The 
conditional  variance  var(y|x  =  v)  is  the  variance  of  y  with  respect  to  the 
conditional  distribution  fy\x=v.  This  variance  depends  on  the  value  of  v,  and 
the  mean  of  this  variance  satisfies  (see  Exercise  1.2) 
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E[var(y|x)]  =  J  fx(v)  fy\x=v(w){w  -  E[y\x  =  v])2 du^j  dv  <  var(y). 

(1.16) 

So,  on  average,  the  conditional  random  variable  y\x  =  n  has  a  smaller 
variance  than  the  unconditional  random  variable  y.  That  is,  knowledge  of 
the  outcome  of  the  variable  x  helps  to  reduce  the  uncertainty  about  the 
outcome  of  y.  This  is  an  important  motivation  for  econometric  models 
with  explanatory  variables.  In  such  models,  the  differences  in  the  outcomes 
of  the  variable  of  interest  (y)  are  explained  in  terms  of  underlying  factors  (x) 
that  influence  this  variable.  For  instance,  the  variation  in  the  FGPA  scores  of 
students  can  be  related  to  differences  in  student  abilities  as  measured  by  their 
SATM  and  SATV  scores.  Such  econometric  models  with  explanatory  vari¬ 
ables  are  further  discussed  in  Chapters  2  and  3. 

Independence 

A  special  situation  occurs  when  the  conditional  distribution  is  always  equal 
to  the  marginal  distribution.  For  discrete  distributions  this  is  the  case  if  and 
only  if  P[y  =  Wj\x  =  v,]  =  P[y  =  iVj]  for  all  (f,-,  wj)  —  that  is, 

P[x  =  v„  y  =  Wj\  =  P[x  =  v,\P[y  =  wj\ 

for  all  (uj,  iVj).  For  continuous  distributions  the  condition  is  that 

f(v,  w)  =  fx{v)fy{w ) 

for  all  (v,  w).  If  this  holds  true,  then  x  and  y  are  called  independent  random 
variables.  So  in  this  case  the  joint  distribution  is  simply  obtained  by  multi¬ 
plying  the  marginal  distributions  with  each  other.  It  follows  from  (1.12) 
and  (1.13)  that  for  independent  variables  E[y\x  =  v]  =  E[y]  is  independent 
of  the  value  v  of  x.  Further,  for  independent  variables  there  holds  var(y  |x  =  v) 
=  var(y)  for  all  values  x  =  v,  and  hence  also  E[var(y|x)]  =  var(y).  If  x  and  y 
are  independent,  then  the  uncertainty  of  y  is  not  diminished  by  conditioning 
on  x,  that  is,  the  variable  x  does  not  contain  information  on  the  variable  y. 
Independent  variables  are  always  uncorrelated,  but  the  reverse  does  not  hold 
true  (see  Exercise  1.2). 

More  than  two  random  variables 

The  definitions  of  joint,  marginal,  and  conditional  distributions  are  easily 
extended  to  the  case  of  more  than  two  random  variables.  For  instance,  the 
joint  density  function  of  p  continuous  random  variables  y\ ,---,yp  is  a 
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function  f(v  1,  ■  ■  ■ ,  vp)  that  is  non-negative  everywhere  and  that  integrates 
(over  the  p-dimensional  space)  to  unity.  Means,  variances,  and  covariances 
can  be  determined  from  the  joint  distribution.  For  instance,  for  continuous 
distributions  the  covariance  between  y\  (with  mean  p, )  and  y2  (with  mean 
lh)  is  given  by  a12  =  cov(yi,  y2)  =  E[(yi  -  Pi)(y2  -  ju2)]  —  that  is,  the  p- 
dimensional  integral 


fJ  12 


( V\  -  Pi){v2  -  p2)f(vi,  ••  • ,  Vp)dv  1  ■  ■  ■  dvp. 


These  variances  and  covariances  can  be  collected  in  the  p  x  p  symmetric 
covariance  matrix 


2= 

/  var(yi)  cov(yi,y2)  ■ 
cov(y2,yi)  var(y2) 

■■  cov(yi,  yp)\ 
■■  cov(y2,  yp) 

f  O n  <T l2  ' 
cr2l  a22  ■ 

■  ffi  p\ 

■  ■  ^2p 

\cov(yp,yi)  co v{yp,y2)  ■ 

■■  var  (yp)  ) 

K^pi  api  ■ 

"  app) 

The  correlation  matrix  is  defined  in  an  analogous  way,  replacing  the  elements 
a ij  in  2  by  the  correlations  p;/  =  ^=-  The  variables  are  independent  if  and 
only  if  the  joint  density  f  is  equal  to  the  product  of  the  p  individual  marginal 
densities  fyi  of  y,  —  that  is, 


P 

f(v  1,  -  ‘  ,  Vp)  =Y[fy,{v,)- 
7=1 

Independent  variables  are  uncorrelated,  so  that  in  this  case  a,j  =  0  for  all 
i  ^  /•  If  in  addition  all  the  variables  have  equal  variance  an  =  er2,  then  the 
covariance  matrix  is  of  the  form  £  =  a1 1  where  I  is  the  p  x  p  identity  matrix. 

Linear  transformations  of  random  variables 

For  our  statistical  analysis  in  later  chapters  we  now  consider  the  distribution 
of  functions  of  random  variables.  For  linear  transformations  the  first  and 
second  moments  of  the  transformed  variables  can  be  determined  in  a  simple 
way.  Let  y\,  ■  ■  ■ ,  yp  be  given  random  variables  and  let  z  =  b  4-  Yjj=\  aiJj  a 
linear  function  of  these  random  variables,  for  given  (non-random)  constants 
b  and  aj.  Then  the  mean  and  variance  of  z  are  given  by 

P  P  P 

E[z\  =  b  +  ajE[yj\,  var  (z)  =  EE  ajakcov(yh  yk) 

7=1  7=1  k=\ 
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where  cov(y/,  y7)  =  var (y7).  When  the  random  variables  y7  are  uncorrelated 
and  have  identical  mean  and  variance  a2,  it  follows  that  E[z]  =  b  +  nYl  aj 
and  var(z)  —^2J2af-  For  instance,  when  J  =  \  XJ”=1  Ji  is  the  mean  of  n 
uncorrelated  random  variables,  then  it  follows  that 

2 

E[y\=  n,  var(y)  =  — .  (1.17) 

n 

Now  let  z  =  Ay  +  b  be  a  vector  of  random  variables,  where  A  and  b  are  a 
given  (non-random)  m  x  p  matrix  and  m  x  1  vector  respectively  and  where  y 
is  a  p  x  1  vector  of  random  variables  with  vector  of  means  f.i  and  covariance 
matrix  X.  Then  the  vector  of  means  of  z  and  its  covariance  matrix  X~  are 
given  by 


E[z]  =  Afi  +  b,  X,  =  AXA'  (1.18) 

where  A'  denotes  the  transpose  of  the  matrix  A  (see  Exercise  1.3). 

Arbitrary  transformations  of  random  variables 

The  distribution  of  non-linear  functions  of  random  variables  can  be 
derived  from  the  joint  distribution  of  these  variables.  For  example,  let  z i  = 
gi(yi,  y2)  and  z2  =  gi(y\ ,  yi)  be  two  functions  of  given  random  variables 
yi  and  y2.  Suppose  that  the  mapping  g  =  (gi,  gi)  from  (yi,  y2)  to  (zi,  z2) 
is  invertible  with  inverse  h  =  (h\,  h2).  The  Jacobian  /  is  defined  as  the 
determinant  of  the  2x2  matrix  with  elements  for  i,  j  =  1,  2.  For 

discrete  random  variables,  the  distribution  of  (zi,  Zi)  is  given  by 
P[Z1  =  tv i,  z2  =  iv2\  =  P[y i  =  Vi,  y2  =  v2\  where  {v\,  v2)  =  h(w\,  iv2). 
For  continuous  random  variables,  the  joint  density  function  of  (zi,  z2)  is 
given  by 


fzt,z2(wU  tv2)  =  fyuyi{h{wi,  w2))\J(wi,  w2) |.  (1.19) 

That  is,  the  density  of  (y\,  y2)  should  be  evaluated  at  the  point  h(w\,  w2)  and 
the  result  should  be  multiplied  by  the  absolute  value  of  the  Jacobian  /  in 
(w\,  w2).  This  result  generalizes  to  the  case  of  more  than  two  functions. 
When  Zi  =  gi(yi)  and  z2  =  g2(y2)  and  y\  and  y2  are  independent,  then  it 
follows  from  (1.10)  and  (1.19)  that  z\  and  z2  are  also  independent  (see 
Exercise  1.3).  So  in  this  case  z\  and  z2  are  uncorrelated,  so  that 
E[gi(yi)g2{y2)]  =  T[gi(yi)J£[g2(y2)]  when  yi  and  y2  are  independent.  If  yi 
and  y2  are  uncorrelated  but  not  independent,  then  Zi  and  z2  are  in  general  not 
uncorrelated,  unless  gi  andg2  are  linear  functions  (see  Exercise  1.3). 
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Example  1 .6:  Student  Learning  (continued) 

As  an  illustration  we  consider  again  the  data  on  student  learning  of  609 
students.  In  this  example  we  will  consider  these  609  students  as  the  popula¬ 
tion  of  interest  and  we  will  analyse  the  effect  of  the  gender  of  the  student  by 
conditioning  with  respect  to  this  variable.  Exhibit  1.6  shows  histograms  of 
the  variable  FGPA  for  male  students  ( a )  and  female  students  (b)  separately. 
Of  the  609  students  in  the  population,  373  are  male  and  236  are  female.  The 
two  means  and  standard  deviations  in  Exhibit  1.6  are  conditional  on  the 
gender  of  the  student  and  they  differ  in  the  two  groups.  The  mean  and 
standard  deviation  of  the  unconditional  (full)  population  are  in  Exhibit  1.4 
(a).  The  relations  (1.15)  and  (1.16)  (more  precisely,  their  analogue  for  the 
current  discrete  distributions)  are  easily  verified,  using  the  fact  that  the 
conditioning  variable  x  in  this  case  is  a  discrete  random  variable  with 
probabilities  373/609  for  a  male  and  236/609  for  a  female  student.  Indeed, 
denoting  males  by  M  and  females  by  F,  we  can  verify  the  result  (1.15)  for  the 
mean  because 


Em[y|*]]  =  |7|£[y|M]  +  |2|e[j.|F]  =  |T|(2..728) 
+  1^(2.895)  =  2.793  =  E\y], 


and  we  can  verify  the  result  (1.16)  for  the  variance  because 


E[var(y|x)]=^var(y|M)+g|var(y|E)  =  g|(0.441)2  +  ^(0.472)2 
=  0.206  <  0.212  =  (0.460)2  =  var(y). 


(a) 


Series:  FGPA 

Sample  2  608 
Observations  373 

Mean 

2.728239 

Median 

2.688000 

Maximum 

3.948000 

Minimum 

1.500000 

Std.  Dev. 

0.441261 

Skewness 

0.217720 

Kurtosis 

2.658253 

( b ) 


Series:  FGPA 

Sample  1  609 

Observations  236 

Mean 

2.894831 

Median 

2.896500 

Maximum 

3.971000 

Minimum 

1.805000 

Std.  Dev. 

0.471943 

Skewness 

0.034324 

Kurtosis 

2.353326 

Exhibit  1.6  Student  Learning  (Example  1.6) 


Histograms  for  FGPA  scores  of  males  (a)  and  females  (b). 


Exercises:  T:  1.2,  1 .3b— d;  E:  l.lle. 
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1.2.3  Probability  distributions 

“S3  First  used  in  Section  2.2.3;  uses  Appendix  A.2-A.5. 


Bernoulli  distribution  and  binomial  distribution 

In  this  section  we  consider  some  probability  distributions  that  are  often  used 
in  econometrics.  The  simplest  case  of  a  random  variable  is  a  discrete  variable 
y  with  only  two  possible  outcomes,  denoted  by  0  (failure)  and  1  (success). 
The  probability  distribution  is  completely  described  by  the  probability 
p  =  P[y  =  1]  of  success,  as  P[y  =  0]  =  1  —  P[y  =  1]  =  1  —  p.  This  is  called 
the  Bernoulli  distribution.  It  has  mean  p  and  variance  p{  1  —  p)  (see  Exercise 
1.4).  Suppose  that  the  n  random  variables  y,,  i  =  1,  ■  ■  • ,  n,  are  independent 
and  identically  distributed,  with  the  Bernoulli  distribution  with  probability  p 
of  success.  Let  y  =  YTi=  l  3 H  be  the  total  number  of  successes.  The  set  of 
possible  outcome  values  of  y  is  V  =  {0,  1,  ■  ■  ■ ,  n },  and 

P[y  =  v]  =  Qpv(i-p)n-v 

(the  first  term,  ‘n  over  v\  is  the  number  of  possibilities  to  locate  v  successes 
over  n  positions).  This  is  called  the  binomial  distribution.  It  has  mean  np  and 
variance  np{  1  —  p)  (see  Exercise  1.4). 


Normal  distribution 

The  normal  distribution  is  the  most  widely  used  distribution  in  econometrics. 
One  of  the  reasons  is  the  central  limit  theorem  (to  be  discussed  later),  which 
says  that  many  distributions  can  be  approximated  by  normal  distributions  if 
the  sample  size  is  large  enough.  Another  reason  is  that  the  normal  distribu¬ 
tion  has  a  number  of  attractive  properties.  A  normal  random  variable  is  a 
continuous  random  variable  that  can  take  on  any  value.  Its  density  function 
is  given  by 


f(v)  —  — -j=e  ^ v  ^  ,  —  oo  <  v  <  oo.  (1.20) 

a\j2n 

This  function  is  symmetric  around  ja  and  it  is  shaped  like  a  bell  (see  Exhibit 
1.7).  The  distribution  contains  two  parameters,  f.i  and  a2,  and  the  distribu¬ 
tion  is  denoted  by  N [n,  a2).  This  notation  is  motivated  by  the  fact  that  f.i  is  the 
mean  and  a2  the  variance  of  this  distribution.  The  third  and  fourth  moments 
of  this  distribution  are  0  and  3er4  respectively  (see  Exercise  1.4),  so  that  the 
skewness  is  zero  and  the  kurtosis  is  equal  to  3.  As  the  normal  distribution  is 
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(a)  ( b )  ( c ) 


Exhibit  1.7  Normal  distribution 

Density  functions  of  two  normal  distributions,  one  with  mean  0  and  variance  1  (a)  and  another 
one  with  mean  3  and  variance  2  (b).  The  plot  in  (c)  shows  the  two  densities  in  one  diagram  for 
comparison. 


often  taken  as  a  benchmark,  distributions  with  kurtosis  larger  than  three  are 
called  fat-tailed. 

When  y  follows  the  N(/t,  a1)  distribution,  this  is  written  as  y  ~  N (/r,  a1). 
The  result  in  (1.10)  implies  (see  Exercise  1.4)  that  the  linear  function  ay  +  b 
(with  a  and  b  fixed  numbers)  is  also  normally  distributed  and 

ay  +  b  ~  N (an  +  b,  a1a1). 

In  particular,  when  y  is  standardized  by  subtracting  its  mean  and  dividing  by 
its  standard  deviation,  it  follows  that 


y-v 

-  r-j 

G 


N(0,  1). 


This  is  called  the  standard  normal  distribution.  Its  density  function  is  de¬ 
noted  by  (/>,  so  that 


m  =  -, =e~K 

\2n 

and  the  cumulative  distribution  function  is  denoted  by  T>(f)  =  (j){u)du. 

Multivariate  normal  distribution 

In  later  chapters  we  will  often  consider  jointly  normally  distributed  random 
variables.  It  is  very  convenient  to  use  matrix  notation  to  describe  the  multi¬ 
variate  normal  distribution.  The  multivariate  normal  distribution  of  n 
random  variables  has  density  function 


1.2  Random  variables  31 


f{v) 


(27i)"/2(det(S))1/2 


(1.21) 


where  v  denotes  the  n  variables,  f.i  is  an  n  x  1  vector,  and  £  an  n  x  n  positive 
definite  matrix  (det(X)  denotes  the  determinant  of  this  matrix).  This  notation 
is  motivated  by  the  fact  that  this  distribution  has  mean  n  and  covariance 
matrix  2.  The  distribution  is  written  as  N (fi,  2). 


Properties  of  the  multivariate  normal  distribution 

Marginal  and  conditional  distributions  of  normal  distributions  remain 
normal.  If  y  ~  N (fx,  2),  then  the  zth  component  y,  is  also  normally  distrib¬ 
uted  and  y,  ~  N(/.z;,  cr„)  where  //,  is  the  zth  component  of  //  and  a,,  the  zth 
diagonal  element  of  2.  For  the  conditional  distribution,  let  the  vector  y  be 
split  in  two  parts  (with  sub-vectors  y\  and  y2)  and  let  the  mean  vector  and 
covariance  matrix  be  split  accordingly.  Then  the  conditional  distribution  of 
y\,  given  that  y2  =  v2,  is  given  by 

yib2  =  Vi  ~  +  Xi2x£(v2  -  H2),  Sn  -  2122^221)  (1.22) 

where  2n  is  the  covariance  matrix  of  y\,  222  is  the  covariance  matrix  of 
y2,  2i 2  is  the  covariance  matrix  between  y\  and  y2,  and  221  is  the  transpose 
of  2i2  (see  Exercise  1.4).  Note  that  the  conditional  variance  does  not  depend 
on  the  value  of  y2  in  this  case.  That  is,  knowledge  of  the  value  of  y2  always 
leads  to  the  same  reduction  in  the  uncertainty  of  y\  if  the  variables  are 
normally  distributed. 

For  arbitrary  random  variables,  independence  implies  being  uncorrelated 
but  not  the  other  way  round.  However,  when  jointly  normally  distributed 
variables  are  uncorrelated,  so  that  2  is  a  diagonal  matrix,  then  the  joint  density 
(1.21)  reduces  to  the  product  of  the  individual  densities.  That  is,  when  nor¬ 
mally  distributed  variables  are  uncorrelated  they  are  also  independent.  This 
also  follows  from  (1.22),  as  2n  =  0  if  y  \  and  30  are  uncorrelated  so  that  the 
conditional  distribution  of  y\  becomes  independent  of  y2. 

If  the  n  x  1  vector  y  is  normally  distributed,  then  the  linear  function 
Ay  +  b  (with  A  a  given  m  x  n  matrix  and  b  a  given  rax  1  vector)  is  also 
normally  distributed  and  (see  Exercise  1.4) 

Ay  +  b  ~  N{An  +  b,  A2A').  (1.23) 


Chi-square  (\2)  distribution 

In  the  rest  of  this  section  we  consider  the  distribution  of  some  other  functions 
of  normally  distributed  random  variables  that  will  be  used  later  on.  Suppose 
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that  y\,  ■  ■  ■ ,  y„  are  independent  and  all  follow  the  standard  normal  distribu¬ 
tion.  Then  the  distribution  of  the  sum  of  squares  Ym= i  3T  is  called  the 
chi-square  distribution  with  n  degrees  of  freedom,  denoted  by  y2(n).  This 
can  be  generalized  to  other  quadratic  forms  in  the  vector  of  random  variables 
y  =  (yi,  ■  •  • ,  y„)' .  Let  A  be  an  n  x  n  matrix  that  is  symmetric  (that  is,  A'  =  A), 
idempotent  (that  is,  A2  =  A)  and  that  has  rank  r  (which  in  this  case  is  equal 
to  the  trace  of  A  —  that  is,  the  sum  of  the  n  diagonal  elements  of  this 
matrix).  Then 


y' Ay  ~  y2(r)  (1-24) 

(see  Exercise  1.5).  For  a  symmetric  idempotent  matrix  A  there  always  holds 
that  y' Ay  >  0.  The  density  of  the  y2(r)  distribution  is  given  by 

f[v )  oc  v  >  0, 

(1.25) 

=  0,  v  <  0, 

where  oc  means  ‘proportional  to’  —  that  is,  f(v)  is  equal  to  the  given 
expression  up  to  a  scaling  constant  that  does  not  depend  on  v.  This  scaling 
constant  is  defined  by  the  condition  that  f  f(v)dv  =  1.  The  y2(r )  distribution 
has  mean  r  and  variance  2 r  (see  Exercise  1.5).  Exhibit  1.8  shows  chi-square 
densities  for  varying  degrees  of  freedom.  The  distributions  have  a  positive 
skewness. 

Student  t-distribution 

If  y\  ~  N(0,  1)  and  yi  ~  y2(r)  and  y\  and  yi  are  independently  distributed, 
then  the  distribution  of  y\/ \fyi[r  is  called  the  Student  t-distribution  with  r 
degrees  of  freedom,  written  as 


(a)  (b)  (c) 


XXX 


Exhibit  1.8  x2-distribution 

Density  functions  of  two  chi-squared  distributions,  one  with  4  degrees  of  freedom  (a)  and 
another  one  with  8  degrees  of  freedom  ( b ).  The  plot  in  (c)  shows  the  two  densities  in  one 
diagram  for  comparison. 


1.2  Random  variables  33 


—} =~f(r).  (1.26) 

vyi/r 

Up  to  a  scaling  constant,  the  density  of  the  t( ^-distribution  is  given  by 

1 

f(v)  cx - 7^,  -oo<v<oo.  (1.27) 

(i+f)’ 

For  r  >  1  the  mean  is  equal  to  0,  and  for  r  >  2  the  variance  is  equal  to  rr_T 
Exhibit  1.9  shows  ^-distributions  for  varying  degrees  of  freedom.  These 
distributions  are  symmetric  (the  skewness  is  zero)  and  have  fat  tails  (the 
kurtosis  is  larger  than  three).  For  r  =  1,  the  t(l)-distribution  (also  called  the 
Cauchy  distribution)  has  density 


f(v) 


1 

n(  1  +  v 2) 


This  distribution  is  so  much  dispersed  that  it  does  not  have  finite  moments  — 
in  particular,  the  mean  and  the  variance  do  not  exist.  On  the  other  hand,  if 
r  — >  oo  then  the  t(r)  density  converges  to  the  standard  normal  density  (see 
Exercise  1.5). 


Exhibit  1.9  ^-distribution 

Density  functions  of  three  f-distributions,  with  number  of  degrees  of  freedom  equal  to  1  (a), 
4  ( b ),  and  100  (c).  The  plot  in  (d)  shows  the  three  densities  in  one  diagram  for  comparison. 
For  more  degrees  of  freedom  the  density  is  more  concentrated  around  zero  and  has  less  fat 
tails. 
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F-distribution 

If  y1  ~  /2(n)  and  y>2  ~  /2(r2)  and  y\  and  y2  are  independently  distributed, 
then  the  distribution  of  (y\ /r\ )  /  (yi/ri )  is  called  the  F-distribution  with  r\  and 
r2  degrees  of  freedom.  This  is  written  as 

^~F(n,r2).  (1.28) 

yi/n 

Exhibit  1.10  shows  F-distributions  for  varying  degrees  of  freedom.  If  r2  — >  oo, 
then  n  ■  F(n,  r2)  converges  to  the  /2(n)-distribution  (see  Exercise  1.5). 

Conditions  for  independence 

In  connection  with  the  t-  and  F-distributions,  it  is  for  later  purposes  helpful 
to  use  simple  checks  for  the  independence  between  linear  and  quadratic 
forms  of  normally  distributed  random  variables.  Let  y  ~  N(0,  I)  be  a  vector 


Exhibit  1.10  F-distribution 


Density  functions  of  three  F-distributions,  with  numbers  of  degrees  of  freedom  in  numerator 
and  denominator  respectively  (4,4)  (a),  (4,100)  (b),  and  (100,4)  (c).  The  plot  in  (d)  shows 
the  three  densities  in  one  diagram  for  comparison.  For  more  degrees  of  freedom  in  the 
numerator  the  density  shifts  more  to  the  right,  and  for  more  degrees  of  freedom  in 
the  denominator  it  gets  less  fat  tails. 
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of  independent  standard  normal  random  variables,  and  let  zo  =  Ay, 
Zi  =  y'Qiy  and  zi  =  y'Qiy  be  respectively  a  linear  form  (with  A  an  m  x  n 
matrix)  and  two  quadratic  forms  (with  Q!  and  Q2  symmetric  and  idempo- 
tent  n  x  n  matrices).  The  two  following  results  are  left  as  an  exercise  (see 
Exercise  1.5).  The  random  variables  Zo  (with  normal  distribution)  and  Z\ 
(with  ^-distribution)  are  independently  distributed  if 


AQ\  =  0,  (1.29) 

and  the  random  variables  Z\  and  Z2  (both  with  ^-distribution)  are  independ¬ 
ently  distributed  if 


Q1Q2  =  0. 


(1.30) 


Exercises:  T:  1.4,  1.5a-e,  1 . 1 3 f ,  1.15b. 


1.2.4  Normal  random  samples 

“S?  First  used  in  Section  2.2.3;  uses  Appendix  A.2-A.5. 


To  illustrate  some  of  the  foregoing  results,  we  consider  the  situation  where 
y\,  ■  ■  ■ ,  y„  are  normally  and  independently  distributed  random  variables  with 
the  same  mean  /i  and  variance  a1.  This  is  written  as 

Vi  ~  NID(m,  a2),  f  =  l,  •••,»,  (1.31) 

where  NID  stands  for  normally  and  independently  distributed.  One  also  says 
that  y\,  ■  ■  ■ ,  y„  is  a  random  sample  (that  is,  with  independent  observations) 
from  N (p,  a1).  We  are  interested  in  the  distributions  of  the  sample  mean  y  in 
(1.1)  and  of  the  sample  variance  s2  in  (1.3). 

Sample  mean 

Let  y  be  the  n  x  1  vector  with  elements  y\,  ■  ■  ■ ,  yn,  so  that 


y  ~  N(/u,  a 2I) 


(1.32) 
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where  i  is  the  n  x  1  vector  with  all  its  elements  equal  to  1  and  I  is  the  n  x  n 
identity  matrix.  The  sample  mean  is  given  by  y  =  £  J2  Vi  =  \ and  as  ih  =  « 
and  i'li  —  n  it  follows  from  (1.23)  that 


(1.33) 


Sample  variance 

To  derive  the  distribution  of  the  sample  variance  s2,  let 

Zi  =  (y,  —  ju)/cr 

so  that  Zi  —  NID(0,  1).  Then  s2  can  be  written  as  s2  =  (ji  —y)1 

=  -^ziYT,=i  (Zi  ~  z)2.  Now  ( Zi  —  z )2  =  z'Mz  where  the  matrix  M  is  de¬ 

fined  by 


M  =  (1.34) 

n 

The  matrix  M  is  symmetric  and  idempotent  and  has  rank  n  —  1  (see  Exercise 
1.5).  Then  (1.24)  shows  that 

{n~^s2  =  z'Mz-  y2(n-  1).  (1.35) 


The  t-value  of  the  sample  mean 

Using  the  notation  introduced  above,  the  result  (1.33)  implies  that 
-jy/z  =  \fnz  =  \fn(y  —  h)/<t  —  N(0,  1).  As  i'M  =  0,  it  follows  from  (1.29) 
that  this  standard  normal  random  variable  is  independent  from  the  y2(n  —  1) 
random  variable  in  (1.35).  By  definition, 


sfh{y-n)/o  y-n 


y/^Kn-  1)  S,'M 


—  t(n  —  1). 


(1.36) 


Note  that  the  random  variable  in  (1.36)  has  a  distribution  that  does  not 
depend  on  a2.  Such  a  random  variable  (in  this  case,  a  function  of  the  data  and 
of  the  parameter  p  that  does  not  depend  on  a2)  is  called  pivotal  for  the 
parameter  ji.  The  result  in  (1.35)  shows  that  (n  —  l)s2 /a2  is  pivotal  for  a2. 
Such  pivotal  random  variables  are  helpful  in  statistical  hypothesis  testing,  as 
will  become  clear  in  Section  1.4.2. 
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If  it  is  assumed  that  the  population  mean  is  zero  —  that  is,  f.i  =  0  —  it 
follows  that 


s/y/n 


t(n  —  1). 


This  is  called  the  t-value  of  the  sample  mean. 
Exercises:  T:  1.5f,  1.15a. 


(1.37) 
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1.3  Parameter  estimation 


1.3.1  Estimation  methods 

First  used  in  Section  4.3.2;  uses  Appendix  A.l,  A.7. 


Concepts:  model,  parameters,  estimator,  estimate 

Suppose  that  n  available  observations  y„  i  —  -  ,  n  are  considered  as  the 

outcomes  of  random  variables  with  a  joint  probability  distribution 
fe(y\,  "  ■ ,  Jn)-  Here  it  is  assumed  that  the  general  shape  of  the  distribution  is 
known  up  to  one  or  more  unknown  parameters  that  are  denoted  by  9.  A  set  of 
distributions  {fg;9  £  ©}  is  called  a  model  for  the  observations  —  that  is,  it 
specifies  the  general  shape  of  the  distribution  together  with  a  set  ©  of  possible 
values  for  the  unknown  parameters.  The  numerical  values  of  9  are  unknown, 
but  they  can  be  estimated  from  the  observed  data.  Estimated  parameters  are 
denoted  by  9.  As  an  example,  if  it  is  supposed  that  y,  ~  NID(/f,  a1)  with 
unknown  mean  f.i  and  variance  a2,  then  the  joint  distribution  is  given  by 
(1.32)  with  parameter  set  ©  =  { (/t,  a2);a2>  0}.  The  parameters  can  be  esti¬ 
mated,  for  instance,  by  the  sample  mean  and  sample  variance  discussed  in 
Section  1.1.2. 

In  this  section  we  consider  a  general  framework  for  estimation  with 
corresponding  concepts  and  terminology  that  are  used  throughout  this 
book.  In  all  that  follows,  we  use  the  notation  y,  both  for  the  random  variable 
and  for  the  observed  outcome  of  this  variable.  A  statistic  is  any  given 
function  g{y\ ,  ■  ■  ■ ,  y„)  —  that  is,  any  numerical  expression  that  can  be  evalu¬ 
ated  from  the  observed  data  alone.  An  estimator  is  a  statistic  that  is  used  to 
make  a  guess  about  an  unknown  parameter.  For  instance,  the  sample  mean 
(1.1)  is  a  statistic  that  provides  an  intuitively  appealing  guess  for  the  popula¬ 
tion  mean  ji.  An  estimator  is  a  random  variable,  as  it  depends  on  the  random 
variables  yt.  For  given  observed  outcomes,  the  resulting  numerical  value  of 
the  estimator  is  called  the  estimate  of  the  parameter.  So  an  estimator  is  a 
numerical  expression  in  terms  of  random  variables,  and  an  estimate  is  a 
number.  Several  methods  have  been  developed  for  the  construction  of  esti¬ 
mators.  We  discuss  three  methods  —  that  is,  the  method  of  moments,  least 
squares,  and  maximum  likelihood. 


1.3  Parameter  estimation  39 


The  method  of  moments 

In  the  method  of  moments  the  parameters  are  estimated  as  follows.  Suppose 
that  9  contains  k  unknown  parameters.  The  specified  model  (that  is,  the 
general  shape  of  the  distribution)  implies  expressions  for  the  population 
moments  in  terms  of  9.  If  k  such  moments  are  selected,  the  parameters  9 
can  in  general  be  solved  from  these  k  expressions.  Now  9  is  estimated  by 
replacing  the  unknown  population  moments  by  the  corresponding  sample 
moments.  An  advantage  of  this  method  is  that  it  is  based  on  moments  that 
are  often  easy  to  compute.  However,  it  should  be  noted  that  the  obtained 
estimates  depend  on  the  chosen  moments. 

Example  1 .7:  Student  Learning  (continued) 

To  illustrate  the  method  of  moments,  we  consider  the  FGPA  scores  of  609 
students  in  Example  1.4.  Summary  statistics  of  this  sample  are  in  Exhibit  1.4 
(a),  with  mean  y  =  2.793,  standard  deviation  s  =  0.460,  skewness  0.168, 
and  kurtosis  2.511.  So  the  first  moment  is  2.793  and  the  second  moment 
(1.2)  is  equal  to  mi  =  (n  —  1  )s2/n  =  0.211.  If  these  scores  are  assumed  to  be 
normally  and  independently  distributed  with  mean  and  variance  a2,  the 
first  and  second  moment  of  this  distribution  are  equal  to  /t  and  a2  respect¬ 
ively.  So  the  moment  estimates  then  become  jx  =  2.793  and  a2  =  0.211. 

Instead  of  using  the  second  moment,  one  could  also  use  the  fourth  moment 
to  estimate  a2.  The  fourth  (population)  moment  of  the  normal  distribution  is 
equal  to  3er4.  To  obtain  the  fourth  sample  moment  from  the  summary 
statistics  presented  in  Exhibit  1.4  ( a ),  note  that  the  sample  kurtosis  (K)  is 
equal  to  the  sample  fourth  moment  (m^)  divided  by  s4,  so  that 
m 4  =  Ks4  =  2.51 1(0. 460)4  =  0.112.  The  estimate  a2  of  the  parameter  a 2 
based  on  the  fourth  moment  is  then  obtained  by  solving  3 a4  =  1114,  so  that 
a2  =  \Jm4, /3  =  0.194.  The  above  results  show  that  the  parameter  estimates 
may  be  different  for  different  choices  of  the  fitted  moments.  In  our  example 
the  differences  are  not  so  large. 

Least  squares 

Another  method  for  parameter  estimation  is  least  squares.  We  illustrate  this 
method  for  the  estimation  of  the  population  mean  from  a  random  sample 
y  1,  ■  ■  ■ ,  y„  of  a  distribution  with  unknown  mean  /i  and  unknown  variance  a2. 
Let  e,  =  y,  —  /r;  then  it  follows  that  £[,■■■,  sn  are  identically  and  independ¬ 
ently  distributed  with  mean  zero  and  variance  a2.  This  is  written  as 
e,  ~  IID(0,  a2).  The  model  can  now  be  written  as 


3 H  =  M 


£,  -  IID(0,  a2). 


(1.38) 
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The  least  squares  estimate  is  that  value  of  /t  that  minimizes  the  sum  of 
squared  errors 


s (n)  =  Y  (Vi  -  hf- 

i=  1 

Taking  the  first  derivative  of  this  expression  with  respect  to  gives  the  first 
order  condition  XXi  (T  ~  h)  =  0-  Solving  this  for  f.i  gives  the  least  squares 
estimate  jh  =  y,  the  sample  mean.  Instead  of  least  squares  one  could  also  use 
other  estimation  criteria  —  for  instance,  the  sum  of  absolute  errors 


Zbt-Mi. 

i=  1 

As  will  be  seen  in  Chapter  5  (see  Exercise  5.14),  the  resulting  estimate  is  then 
given  by  the  median  of  the  sample. 

Maximum  likelihood 

A  third  method  is  that  of  maximum  likelihood.  Recall  that  a  model  consists 
of  a  set  {fg;  9  £  0}  of  joint  probability  distributions  for  y\,  ■  ■  ■ ,  yn.  For  every 
value  of  6,  the  distribution  gives  a  certain  value  fg(y i,  ■  ■  ■ ,  yn)  for  the  given 
observations.  When  seen  as  a  function  of  6,  this  is  called  the  likelihood 
function,  denoted  by  L(fi),  so  that 


L(6)  =  fg(y  i,---,  y„),  fie©.  (1.39) 

For  discrete  distributions,  the  likelihood  L(fi)  is  equal  to  the  probability  (with 
respect  to  the  distribution  fg)  of  the  actually  observed  outcome.  The  max¬ 
imum  likelihood  estimate  is  the  value  of  fi  for  which  this  probability  is 
maximal  (over  the  set  of  all  possible  values  fi  G  ©).  Similarly,  for  a  continu¬ 
ous  distribution  the  maximum  likelihood  estimate  is  obtained  by  maximizing 
L(fi)  over  0. 

An  attractive  property  of  this  method  is  that  the  estimates  are  invariant 
with  respect  to  changes  in  the  definition  of  the  parameters.  Suppose  that, 
instead  of  using  the  parameters  fi,  one  describes  the  model  in  terms  of  another 
set  of  parameters  if  and  that  the  relation  between  if  and  fi  is  given  by 
if  =  h(9),  where  h  is  an  invertible  transformation.  The  model  is  then  ex¬ 
pressed  as  the  set  of  distributions  {f^;  if  £  T/'}  where  /)//  =  fh  '(,/,)  and 
"T  =  h{&).  Tet  fi  and  if  be  the  maximum  likelihood  estimates  of  fi  and  if 
respectively.  Then  fi  =  h~l(if),  so  that  both  models  lead  to  the  same  estimated 
probability  distribution  (see  Exercise  1.6  for  an  example). 
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Comparison  of  methods 

In  later  chapters  we  will  encounter  each  of  the  above  three  estimation 
methods.  It  depends  on  the  application  which  method  is  the  most  attractive 
one.  If  the  model  is  expressed  in  terms  of  an  equation,  as  in  (1.38),  then  least 
squares  is  intuitively  appealing,  as  it  optimizes  the  fit  of  the  model  with 
respect  to  the  observations.  Sometimes  the  model  is  expressed  in  terms  of 
moment  conditions,  so  that  the  method  of  moments  is  a  natural  way  of 
estimation. 

Least  squares  and  the  method  of  moments  are  both  based  on  the  idea  of 
minimizing  a  distance  function.  For  least  squares  the  distance  is  measured 
directly  in  terms  of  the  observed  data,  whereas  for  the  method  of  moments  the 
distance  is  measured  in  terms  of  the  sample  and  population  moments.  The 
maximum  likelihood  method,  on  the  other  hand,  is  based  not  on  a  distance 
function,  but  on  the  likelihood  function  that  expresses  the  likelihood  or 
‘credibility’  of  parameter  values  with  respect  to  the  observed  data.  This 
method  can  be  applied  only  if  the  joint  probability  distribution  of  the  obser¬ 
vations  is  completely  specified  so  that  the  likelihood  function  (1.39)  is  a 
known  function  of  9.  In  this  case  maximum  likelihood  estimators  have 
optimal  properties  in  large  samples,  as  will  be  discussed  in  Section  1.3.3. 


Example  1.8:  Normal  Random  Sample 

We  will  illustrate  the  method  of  maximum  likelihood  by  considering  data 
generated  by  a  random  sample  from  a  normal  distribution.  Suppose  that 
yi  ~  NID(/q  a2),  i  =  1,  •  •  • ,  n,  with  unknown  parameters  9  =  (/q  er2).  Then 
the  likelihood  function  is  given  by 


L(/i,  a1 


e  2<r 


As  the  logarithm  is  a  monotonically  increasing  function,  the  likelihood  func¬ 
tion  and  its  logarithm  log(L(/q  a2))  obtain  their  maximum  for  the  same 
values  of  q  and  er2.  As  log  (L(q,  er2))  is  easier  to  work  with,  we  maximize 


log (T(/q  a2))  =  -”log(27i)  -  ”log(u2) -^^(y, -q)2-  (1-40) 

“  ““  1=1 

The  first  order  conditions  (with  respect  to  /i  and  a2)  for  a  maximum  are  given 
by 


<9  log  (T) 
<9q 


1  " 

= — y'b';  -  9)  =  o, 

a 

i=i 


(1.41) 
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n 


(1.42) 


<9  log  (L) 

da 2 


2a2  +  2a4 


Y  ( Vi  ~  /')2  =  °- 


<=i 


The  solutions  of  these  two  equations  are  given  by 


A ML  =  ~Yyi  =  y ’  =  -^(y< -y)2  = 

47  z - *  47  z - ' 


«  —  1 


So  /.i  is  estimated  by  the  sample  mean  and  a 2  by  ( 1  —  2 )  times  the  sample 
variance.  For  large  sample  sizes  the  difference  with  the  sample  variance  s 2 
becomes  negligible. 

To  check  whether  the  estimated  values  indeed  correspond  to  a  maximum 
of  the  likelihood  function  we  compute  the  matrix  of  second  order  derivatives 
(the  Hessian  matrix)  and  check  whether  this  matrix  (evaluated  at 
pML  and  bjqL)  is  negative  definite.  By  differentiating  the  above  two  first 
order  conditions,  it  follows  that  the  Hessian  matrix  is  equal  to 


H(d)  = 


/gMog (L)  d2 log (L)\ 
I  <9/?  djida1  | 

^log(I-)  ^logtL) 

\  da2dii  d(a2)2  ) 


(yi-v)  \ 


(1.43) 


Evaluating  this  at  the  values  of  fiML  and  ajaL  shows  that  H(fi,  a2)  is  a 
diagonal  matrix  with  elements  —  and  —n/2a4ML  on  the  diagonal, 

which  is  indeed  a  negative  definite  matrix. 

Note  that  we  expressed  the  model  and  the  likelihood  function  in  terms 
of  the  parameters  /i  and  a2.  We  could  equally  well  use  the  parameters  f.i 
and  a.  We  leave  it  as  an  exercise  (see  Exercise  1.6)  to  show  that  solving  the 
first  order  conditions  with  respect  to  f.i  and  a  gives  the  same  estimators  as 
before,  which  illustrates  the  invariance  property  of  maximum  likelihood 
estimators. 


Exercises:  T:  1.6a,  b,  1.9d,  1.10a,  c. 


1.3.2  Statistical  properties 

=©  First  used  in  Section  2.2.4;  uses  Appendix  A.2-A.5. 


Data  generating  process 

To  evaluate  the  quality  of  estimators,  suppose  that  the  data  are  generated  by 
a  particular  distribution  that  belongs  to  the  specified  model.  That  is,  the  data 
generating  process  (DGP)  of  y\,  ■  ■  ■ ,  y„  has  a  distribution  fg0  where  0q  £  0. 
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An  estimator  d  is  a  function  of  the  random  variables  y\,  ■  ■  ■ ,  y„,  so  that  9  is 
itself  a  random  variable  with  a  distribution  that  depends  on  do-  The  estimator 
would  be  perfect  if  P[9  =  do]  =  1,  as  it  would  always  infer  the  correct 
parameter  value  from  the  sample.  However,  as  the  observations  are  partly 
random,  do  can  in  general  not  be  inferred  with  certainty  from  the  infor¬ 
mation  in  the  data.  To  evaluate  the  quality  of  an  estimator  we  therefore  need 
statistical  measures  for  the  distance  of  the  distribution  of  d  from  do- 

Variance  and  bias 

First  assume  that  9  consists  of  a  single  parameter.  The  mean  squared  error 
(MSE)  of  an  estimator  is  defined  by  E[(6  —  do)2],  which  can  be  decomposed  in 
two  terms  as 


MSE(d)  =  E[(9  -  d0)2]  =  var(d)  +  (E[9\  -  90 )2.  (1.44) 

Here  all  expectations  are  taken  with  respect  to  the  underlying  distribution  fg0 
of  the  data  generating  process.  The  first  term  is  the  variance  of  the  estimator, 
and  if  this  is  small  this  means  that  the  estimator  is  not  so  much  affected  by  the 
randomness  in  the  data.  The  second  term  is  the  square  of  the  bias  E[dj  —  do, 
and  if  this  is  small  this  means  that  the  estimator  has  a  distribution  that  is 
centred  around  do-  The  mean  squared  error  provides  a  trade-off  between  the 
variance  and  the  bias  of  an  estimator. 

Unbiased  and  efficient  estimators 

The  practical  use  of  the  MSE  criterion  is  limited  by  the  fact  that  MSE(d) 
depends  in  general  on  the  value  of  do-  As  do  is  unknown  (else  there  would  be 
no  reason  to  estimate  it),  one  often  uses  other  criteria  that  can  be  evaluated 
without  knowing  do.  For  instance,  one  can  restrict  the  attention  to  unbiased 
estimators  —  that  is,  with  the  property  that 

E[9\  =  d0, 

and  try  to  minimize  the  variance  var(d)  within  the  class  of  unbiased  estima¬ 
tors.  Assume  again  that  9  consists  of  a  single  parameter.  An  estimator  that 
minimizes  the  variance  over  a  class  of  estimators  is  called  efficient  within  that 
class.  The  Cramer-Rao  lower  bound  states  that  for  every  unbiased  estimator 
9  there  holds 


(E 

'  fd[og(L(9))\r 
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(1.45) 
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where  L(9)  is  the  likelihood  function  and,  as  before,  the  expectations  are 
taken  with  respect  to  the  distribution  with  parameter  9q.  The  proof  of  the 
equality  in  (1.45)  is  left  as  an  exercise  (see  Exercise  1.7).  The  inequality  in 
(1.45)  implies  that  a  sufficient  condition  for  the  efficiency  of  an  estimator  9  in 
the  class  of  unbiased  estimators  with  E[9]  =  9q  is  that  var(0)  is  equal  to  the 
Cramer-Rao  lower  bound.  This  condition  is  not  necessary,  however,  because 
in  some  situations  the  lower  bound  on  the  variance  cannot  be  attained  by  any 
unbiased  estimator. 


Warning  on  terminology 

A  comment  on  the  terminology  is  in  order.  Although  the  property  of  unbia¬ 
sedness  is  an  attractive  one,  this  does  not  mean  that  biased  estimators  should 
automatically  be  discarded.  Exhibit  1.11  shows  the  density  functions  of  two 
estimators,  one  that  is  unbiased  but  that  has  a  relatively  large  variance  and 
another  that  has  a  small  bias  and  a  relatively  small  variance.  In  practice  we 
have  a  single  sample  y\,  ■  ■  ■ ,  yn  at  our  disposal,  and  corresponding  single 
outcomes  of  the  estimators.  As  is  clear  from  Exhibit  1.11,  the  outcome  of  the 
biased  estimator  will  in  general  be  closer  to  the  correct  parameter  value  than 
the  outcome  of  the  unbiased  estimator.  This  shows  that  unbiasedness  should 
not  be  imposed  blindly. 


More  than  one  parameter 

Now  suppose  that  9  consists  of  a  vector  of  parameters,  and  that  9  is  a  vector 
of  estimators  where  each  component  is  an  estimator  of  the  corresponding 
component  of  9.  Then  9  is  unbiased  if  E[0]  =  9q  —  that  is,  if  all  components 
are  unbiased.  For  unbiased  estimators,  the  covariance  matrix  is  given  by 


var(0) 


E 


(9  -  E[9})(9  -  E[0])' 


E 


(0  -  9O)(0  -  90)' 


Exhibit  1.11  Bias  and  variance 

Densities  of  two  estimators,  one  that  is  unbiased  but  that  has  a  larger  variance  and  another  one 
that  is  biased  (downwards)  but  that  has  a  smaller  variance  (0o  denotes  the  parameter  of  the 
data  generating  process). 
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An  estimator  8\  is  called  more  efficient  than  another  estimator  82  if 
var (62)  —  var(0j)  is  a  positive  semidefinite  matrix.  This  means  in  particular 
that  every  component  of  8i  has  a  variance  that  is  at  least  as  large  as  that  of 
the  corresponding  component  of  8 \ .  The  Cramer-Rao  lower  bound  for  the 
variance  of  unbiased  estimators  is  given  by  the  inverse  of  the  so-called  infor¬ 
mation  matrix.  This  matrix  is  defined  as  follows,  where  the  expectations  are 
taken  with  respect  to  the  probability  distribution  with  parameters  8q  and 
where  the  derivatives  are  evaluated  at  8q. 
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So,  for  every  unbiased  estimator  there  holds  that  var(0)  —  Zg1  is  positive 
semidefinite.  A  sufficient  condition  for  efficiency  of  an  unbiased  estimator  of 
the  &th  component  of  8  is  that  its  variance  is  equal  to  the  kth  diagonal 
element  of  If1 . 


Example  1.9:  Normal  Random  Sample  (continued) 

As  in  Example  1.8,  we  consider  the  case  of  data  consisting  of  a  random 
sample  from  the  normal  distribution.  We  suppose  that  yt  ~NID(/q  a2), 
i  —  1,  ■  ■  ■ ,  n,  with  unknown  parameters  8  =  (/i,  a2).  The  maximum  likeli¬ 
hood  estimators  are  given  by  fiML  =  y  and  <f(qL  =  (n  —  l)s2/n  where  s2  is 
the  sample  variance.  We  will  investigate  (i)  the  unbiasedness  of  the  ML 
estimators  fiML  and  (ii)  the  variance  and  efficiency  of  these  two  estima¬ 
tors,  (iii)  simulated  sample  distributions  of  these  two  estimators  and  of 
two  alternative  estimators,  the  median  (for  n)  and  the  sample  variance  s2 
(for  a2),  and  (iv)  the  interpretation  of  the  outcomes  of  this  simulation 
experiment. 

(i)  Means  of  the  ML  estimators  fiML  and  &hL 

As  was  shown  in  Section  1.2.4,  fi  ~  N(/t,  a2 /n)  and  (n  —  l)s2 /a2  ~  y2(n  —  1). 
It  follows  that  E[fiML\  =  / 1  and  that  E[aj^L\  =  (n  —  1  )o2/n  —  that  is,  fiML 
is  unbiased  but  not.  An  unbiased  estimator  of  a2  is  given  by  the 
sample  variance  s2.  This  is  the  reason  to  divide  by  (n  —  1)  in  (1.3)  instead 
of  by  n.  Unless  the  sample  size  n  is  small,  the  difference  between  s2  and  aj^L 
is  small. 

(ii)  Variance  and  efficiency  of  the  ML  estimators 

Now  we  evaluate  the  efficiency  of  the  estimators  y  and  s2  in  the  class  of  all 
unbiased  estimators.  The  variance  of  y  is  equal  to  a2 /n.  As  the  y2(n  —  1) 
distribution  has  variance  2(n  —  1),  it  follows  that  s2  has  variance 
2(7 4/(n—  1).  The  information  matrix  is  equal  to  1q  =  —E[El(8o)],  where 


E 
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60  =  [pi,  ff2)  and  H(6q)  is  the  Hessian  matrix  in  (1.43).  As  £[y,  —  p]  =  0 
and  E[{yi  -  p)1]  =  a1 ,  \t  follows  that 


By  taking  the  inverse,  it  follows  that  the  Cramer-Rao  lower  bounds  for  the 
variance  of  unbiased  estimators  of  p  and  a2  are  respectively  a2 /n  and  2er4  jn. 
So  y  is  efficient,  but  the  variance  of  s2  does  not  attain  the  lower  bound.  We 
mention  that  s2  is  nonetheless  efficient  —  that  is,  there  exists  no  unbiased 
estimator  of  a1  with  variance  smaller  than  2 u4/(w  —  1). 

(iii)  Simulated  sample  distributions 

To  illustrate  the  sampling  aspect  of  estimators,  we  perform  a  small  simula¬ 
tion  experiment.  To  perform  a  simulation  we  have  to  specify  the  data 
generating  process.  We  consider  n  =  10  independent  random  variables 
yi ,  ■  ■  ■ ,  y„  that  are  all  normally  distributed  with  mean  p  =  0  and  variance 
o2  =  1 .  A  simulation  run  then  consists  of  the  outcomes  of  the  variables 
yi,  ■  ■  ■ ,  yio  obtained  by  ten  random  drawings  from  N(0,  1).  Statistical  and 
econometric  software  packages  contain  random  number  generators  for 
this  purpose.  For  such  a  simulated  set  of  ten  data  points,  we  compute 
the  following  statistics:  the  sample  mean  y,  the  sample  median  med(y),  the 
sample  variance  s2,  and  the  second  sample  moment  m2  ~  °ML-  The  values 
of  these  statistics  depend  on  the  simulated  data,  so  that  the  outcomes  will 
be  different  for  different  simulation  runs.  To  get  an  idea  of  this  variation 
we  perform  10,000  runs.  Exhibit  1.12  shows  histograms  for  the  resulting 
10,000  outcomes  of  the  statistics  y  in  (a),  med(y)  in  (b),  s2  in  (c),  and 
a ml  in  (d),  together  with  their  averages  and  standard  deviations  over  the 
10,000  runs. 

(iv)  Interpretation  of  simulation  outcomes 

Both  the  sample  mean  and  the  median  have  an  average  close  to  the  mean 
p  =  0  of  the  data  generating  process,  but  the  sample  mean  has  a  smaller 
standard  deviation  than  the  median.  This  is  in  line  with  the  fact  that  the 
sample  mean  is  the  efficient  estimator.  Also  note  that  the  sample  standard 
deviation  of  the  sample  mean  over  the  10,000  runs  (0.3159,  see  (a))  is  close 
to  the  theoretical  standard  deviation  of  the  sample  mean  (which 
is  o I \Jn  =  l/\/T0  =  0.3162).  The  estimates  show  a  downward  bias, 
whereas  s2  has  an  average  that  is  close  to  the  variance  a2  =  1  of  the  data 
generating  process.  This  is  in  line  with  the  fact  that  (r)qL  is  biased  and  s2 
is  unbiased.  The  theoretical  expected  value  of  a2  is  equal  to  ^er2  =  0.9, 
which  is  close  to  the  sample  average  of  the  estimates  over  the  10,000 
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(a)  ( b ) 


(c) 


(■ d ) 


Exhibit  1.12  Normal  Random  Sample  (Example  1.9) 


Histograms  of  sample  mean  (a),  sample  median  (b),  sample  variance  (c),  and  second  sample 
moment  (d)  obtained  in  10,000  simulation  runs.  Each  simulation  run  consists  of  ten  random 
drawings  from  the  standard  normal  distribution  and  provides  one  outcome  of  the  four  sample 
statistics. 


simulation  runs  (0.901).  The  standard  deviations  of  o2ML  and  s 2  over  the 
10,000  runs  are  in  line  with  the  theoretical  standard  deviations  of 
\J2 (n  —  1  )/n2  =  0.424  for  a]AL  (as  compared  to  a  value  of  0.418  in  (d) 
over  the  10,000  simulation  runs)  and  \Jl] (n  —  1)  =  0.471  for  s2  (as  com¬ 
pared  to  a  value  of  0.465  in  (c)  over  the  10,000  simulation  runs). 


“S?  Exercises:  T:  1.6c,  1.7a,  1.8a-c,  1.9a-c,  e,  1.10a. 


1.3.3  Asymptotic  properties 

“S3  First  used  in  Section  4.1;  uses  Appendix  A.2-A.5. 


Motivation 

In  some  situations  the  sample  distribution  of  an  estimator  is  known  exactly. 
For  instance,  for  random  samples  from  the  normal  distribution,  the  sample 
mean  and  variance  have  distributions  given  by  (1.33)  and  (1.35).  In  other 
cases,  however,  the  exact  finite  sample  distribution  of  estimators  is  not 
known.  This  is  the  case  for  many  estimators  used  in  econometrics,  as  will 
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become  clear  in  later  chapters.  Basically  two  methods  can  be  followed  in 
such  situations.  One  method  is  to  simulate  the  distribution  of  the  estimator 
for  a  range  of  data  generating  processes.  A  possible  disadvantage  is  that  the 
results  depend  on  the  chosen  parameters  of  the  data  generating  process. 
Another  method  is  to  consider  the  asymptotic  properties  of  the  estimator  — 
that  is,  the  properties  if  the  sample  size  n  tends  to  infinity.  Asymptotic 
properties  give  an  indication  of  the  distribution  of  the  estimator  in  large 
enough  finite  samples.  A  possible  disadvantage  is  that  it  may  be  less  clear 
whether  the  actual  sample  size  is  large  enough  to  use  the  asymptotic  proper¬ 
ties  as  an  approximation. 

Consistency 

In  this  section  we  discuss  some  asymptotic  properties  that  are  much  used  in 
econometrics.  Let  9  be  a  parameter  of  interest  and  let  9„  be  an  estimator  of  9 
that  is  based  on  a  sample  of  n  observations.  We  are  interested  in  the  proper¬ 
ties  of  this  estimator  when  n  — >  oo,  under  the  assumption  that  the  data  are 
generated  by  a  process  with  parameter  9q.  The  estimator  is  called  consistent  if 
it  converges  in  probability  to  9q  —  that  is,  if  for  all  5  >  0  there  holds 

lim  P[\6„  -  0O |  <  d|  =  1.  (1.47) 

n—>oo 


In  this  case  do  is  called  the  probability  limit  of  9n,  written  as  plim(d„)  =  9q.  If 
9  is  a  vector  of  parameters,  an  estimator  9„  is  called  consistent  if  each 
component  of  9„  is  a  consistent  estimator  of  the  corresponding  component 
of  9.  Consistency  is  illustrated  graphically  in  Exhibit  1.13.  The  distribution 
of  the  estimator  becomes  more  and  more  concentrated  around  the  correct 
parameter  value  9q  if  the  sample  size  increases.  A  sufficient  (but  not  neces¬ 
sary)  condition  for  consistency  is  that 


Exhibit  1.13  Consistency 

Distribution  of  a  consistent  estimator  for  three  sample  sizes,  with  <  «2  <  «3-  If  the  sample 
size  gets  larger,  then  the  distribution  becomes  more  concentrated  around  the  parameter  do 
of  the  data  generating  process. 
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lim  E[0„\  =  0O  and  lim  var (6„)  =  0,  (1-48) 


that  is,  if  the  estimator  is  asymptotically  unbiased  and  its  variance  tends  to 
zero  (see  Exercise  1.7). 

Calculation  rules  for  probability  limits 

Probability  limits  are  easy  to  work  with,  as  they  have  similar  properties  as 
ordinary  limits  of  functions.  Suppose  that  y„  and  zn  are  two  sequences  of 
random  variables  with  probability  limits  plim(y„)  =  c\  and  plim(z„)  = 
ci  (7^  0),  then  there  holds,  for  instance,  that 

plim  (yn  +  zn)  =  c\  +  ci,  plim  (y„z„)  =  c\c2,  plim  (yn/z„)  =  ci/d 

(see  Exercise  1.7).  Note  that  for  expectations  there  holds  E[y  +  z]  = 
E[y]  +  E[z],  but  E[yz\  =  E[y]E[z]  holds  only  if  y  and  z  are  uncorrelated, 
and  in  general  E[y/z ]  ^  E[y\/E[z\  (even  when  y  and  z  are  independent).  If 
g  is  a  continuous  function  that  does  not  depend  on  n ,  then 

plim  (g(y„))  =g{c\) 

(see  Exercise  1.7),  so  that,  for  instance,  plim(y^)  =  c\.  Again,  for 
expectations  this  does  in  general  not  hold  true  (unless  g  is  linear)  — 
for  instance,  E[y^]  ^  {E[y„])2  in  general.  This  result  implies  that,  if 
6„  is  a  consistent  estimator  of  do,  then  g(6n)  is  a  consistent  estimator  of 
g(0  o). 

Similar  results  hold  true  for  vector  or  matrix  sequences  of  random 
variables.  Let  A„  be  a  sequence  of  p  x  q  matrices  of  random  variables 
a„(i, ;');  then  we  write  plim(A„)  =  A  if  all  the  elements  converge  so 
that  plim  ( an(i , ;'))  =  a{i,  j)  for  all  /'  =  1,  •  •  • ,  p,  j  =  1,  •  •  • ,  q.  For  two  matrix 
sequences  A„  and  Bn  with  plim  (A„)  =  A  and  plim  (Bn)  =  B  there  holds 

plim  (A„  +  Bn)  =  A  +  B,  plim  (AnB„)  =  AB,  plim  (A^B,,)  =  A_1B, 

provided  that  the  matrices  have  compatible  dimensions  and,  for  the  last 
equality,  that  the  matrix  A  is  invertible. 

Law  of  large  numbers 

When  the  data  consist  of  a  random  sample  from  a  population,  sample 
moments  provide  consistent  estimators  of  the  population  moments.  The 
reason  is  that  the  uncertainty  in  the  individual  observations  cancels  out  in 
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the  limit  by  taking  averages.  That  is,  if  y,  ~  IID,  i  =  1,  •  •  • ,  n,  with  finite 
population  mean  E[y,]  =  n,  then 


plim 


=  F- 


(1.49) 


This  is  called  the  law  of  large  numbers.  To  get  the  idea,  assume  for  simplicity 
that  the  population  variance  er2  is  finite.  Then  the  sample  mean  of  n  observa¬ 
tions  y„  =  \  YTi=  t  Vi  is  a  random  variable  with  mean  g  and  variance  o1  /n,  and 
(1.49)  follows  from  (1.48).  Similarly,  if  yt  ~  IID,  i  =  1,  •  •  • ,  n,  and  the  rth 
population  moment  gr  =  E[(y,  —  g)r]  <  oo,  then 


(y>  -  yJ  J  =  Hr- 

For  instance,  the  sample  variance  converges  in  probability  to  the  population 
variance.  Also  the  sample  covariance  between  two  variables  converges  in 
probability  to  the  population  covariance,  and  so  on. 

Central  limit  theorem 

A  sequence  of  random  variables  y„  with  cumulative  distribution  functions  Fn 
is  said  to  converge  in  distribution  to  a  random  variable  y  with  distribution 
function  F  if  lim^oo  F„{v)  =  F(v)  at  all  points  v  where  F  is  continuous.  This  is 
written  as 


d 

yn  -*■  y 

and  F  is  also  called  the  asymptotic  distribution  of  y„.  A  central  result  in 
statistics  is  that,  under  very  general  conditions,  sample  averages  from  arbi¬ 
trary  distributions  are  asymptotically  normally  distributed.  Let 
yt,  i  =  1,  •  •  • ,  n  be  independently  and  identically  distributed  random  vari¬ 
ables  with  mean  g  and  finite  variance  a1.  Then 

Zn  =  ^nhLlJd.  4Z~N(0,  1)  (1.50) 

a 

This  is  called  the  central  limit  theorem.  This  means  that  (after  standard¬ 
ization  by  subtracting  the  mean,  dividing  by  the  standard  deviation,  and 
multiplying  by  the  square  root  of  the  sample  size)  the  sample  mean  of  a 
random  sample  from  an  arbitrary  distribution  has  an  asymptotic  standard 
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normal  distribution.  For  large  enough  sample  sizes,  the  finite  sample  distri¬ 
bution  of  zn  can  be  approximated  by  the  standard  normal  distribution 
N(0,  1).  It  follows  that  yn  is  approximately  distributed  as  N (a,  ^-),  which 
we  write  as 


Note  that  an  exact  distribution  is  denoted  by  ~  and  an  approximate  distri¬ 
bution  by 

Generalized  central  limit  theorems 

The  above  central  limit  theorem  for  the  IID  case  can  be  generalized  in 
several  directions.  We  mention  three  generalizations  that  are  used  later 
in  this  book.  When  y,  are  independent  random  variables  with  common 
mean  f.i  and  different  variances  a j  for  which  the  average  variance 
a1  =  lim^oo  ]-  J2"=  i  of  is  finite,  then 


When  yi,  i  =  1,  •  •  • ,  n  is  a  random  sample  from  a  p-dimensional  distribution 
with  finite  vector  of  means  /i  and  finite  covariance  matrix  S,  then 

-N(0,  2). 

Now  suppose  that  An  is  a  sequence  of  p  x  p  matrices  of  random  variables 
and  that  y„  is  a  sequence  of  p  x  1  vectors  of  random  variables.  When 
plim(A„)  =  A  where  A  is  a  given  (non-random)  matrix  and  yn  — >  N(0,  2), 
then 

Anyn  4  N(0,  AXA'). 

Asymptotic  properties  of  maximum  likelihood  estimators 

The  law  of  large  numbers  shows  that  moment  estimators  are  consistent  in 
case  of  random  samples.  However,  maximum  likelihood  estimators  are 
consistent  and  (asymptotically)  efficient.  Suppose  that  the  likelihood 
function  (1.39)  is  correctly  specified,  in  the  sense  that  the  data  are  generated 
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by  a  distribution  fg0  with  Oo  £  ©.  Then,  under  certain  regularity  conditions, 
maximum  likelihood  estimators  are  consistent,  asymptotically  efficient,  and 
asymptotically  normally  distributed.  More  in  particular,  under  these  condi¬ 
tions  there  holds 


MOml  -0o)4n(O,  To1).  (1.51) 

Here  Oml  denotes  the  maximum  likelihood  estimator  (based  on  n  obser¬ 
vations)  and  Zo  is  the  asymptotic  information  matrix  evaluated  at  6o  — 
that  is, 


1 

To,  =  lim  (  —  T 


n—>oo  \  fl 


where  Tn  is  the  information  matrix  for  sample  size  n  defined  in  (1.46). 
Asymptotic  efficiency  means  that  \Ztj(0ml  —  do)  has,  for  n  — *■  oo,  the  smallest 
covariance  matrix  among  all  consistent  estimators,  in  the  following  sense. 
Let  6  be  a  consistent  estimator  (based  on  n  observations)  and  let 
£  =  lim^oo  var (y/n(0  -  00))  and  2ml  =  lim,woo  var(^n(0ML  -  Oo)),  then 
2  —  2ml  is  a  positive  semidefinite  matrix.  For  finite  samples  we  obtain 
from  (1.51)  the  approximation 


OML^^iOo,!-1), 


where  Tn  is  the  information  matrix  (1.46)  evaluated  at  Oml ■  The  result  in 
(1.51)  can  be  seen  as  a  generalization  of  the  central  limit  theorem. 


Intuitive  argument  for  the  consistency  of  0ML 

Although  a  formal  proof  of  consistency  falls  outside  the  scope  of  this  book,  it  may 
be  of  interest  to  provide  some  intuition  for  this  result,  without  being  precise  about 
the  required  regularity  conditions.  Suppose  that  y„  i=  l,---,  n  are  IID  with 
common  probability  density  function  fg0.  The  ML  estimator  Oml  is  obtained  by 
maximizing  the  likelihood  function  L{6)  =  Jl/Li  feiVi)  or  equivalently  by  maxi¬ 
mizing  the  log-likelihood  Hog  (L(0))  =  log  (felyi))-  The  first  order  condi¬ 

tions  for  a  maximum  of  this  function  can  be  expressed  as 


d_ 

80 


1  ^2  dlog  (fe(y,))  =  1  y-  1  dfgjyi) 

nj-[  d0  n~~(fe[yi)  dO 
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Under  suitable  regularity  conditions,  the  law  of  large  numbers  applies  to  the  IID 
random  variables  so  that  the  first  order  conditions  converge  in  probabil¬ 

ity  (for  n  — *  oo )  to 


plim 


( 1  1  dfg(yj)\ 

\nj~(fo{yi)  99  ) 


1  dfg(yj)' 
fe(yi)  99 


where  Eo  means  that  the  expectation  should  be  evaluated  according  to  the  density 
fg0  of  the  DGP.  We  will  show  below  that  the  DGP  parameter  9 o  solves  the  above 
asymptotic  first  order  condition  for  a  maximum.  Intuitively,  the  estimator  9ml 
(which  solves  the  equations  for  finite  n)  will  then  converge  to  do  (which  solves  the 
equations  asymptotically)  in  case  n  — >  oo.  This  intuition  is  correct  under  suitable 
regularity  conditions.  Then  plim(0ML)  =  do,  so  that  9ml  is  consistent. 

To  prove  that  do  is  a  solution  of  the  asymptotic  first  order  conditions,  note  that 
fe  is  a  density  function,  so  that  J  fe{yi)dyi  =  1.  Using  this  result,  and  substituting 
0  =  do  in  the  asymptotic  first  order  conditions,  we  get 


1  dfe(yj] 

fe0(yi)  99 


1  dfg{yj 

fe0(y- )  99 


fe0(yi)dyi  = 


'  9fg(yj) 

89 


dyi 


d_ 

89 


J  fe(yi)dy^j  =  J^(l)  =  °- 


This  shows  that  do  solves  the  asymptotic  first  order  conditions. 


Example  1.10:  Simulated  Normal  Random  Sample 

To  illustrate  the  consistency  and  asymptotic  normality  of  maximum  likeli¬ 
hood,  we  consider  the  following  simulation  experiment.  We  generate  a 
sample  of  n  observations  (yi,---,  yn)  by  independent  drawings  from  the 
standard  normal  distribution  N(0,  1)  and  we  compute  the  corresponding 
maximum  likelihood  estimate  =  i^”=1  (y,  —  y)2 ■  The  results  in  Section 
1.3.2  and  Example  1.9  (for  n  =  10)  showed  that  d\AL  is  a  biased  estimator  of 
a1  =  1.  For  each  of  the  sample  sizes  n  =  10,  n  =  100,  and  n  =  1000,  we 
perform  10,000  simulation  runs. 

Exhibits  1.14  (a),  (c),  and  (e)  show  three  histograms  of  the  resulting 
10,000  estimates  of  for  the  three  sample  sizes.  As  the  histograms  become 
strongly  concentrated  around  the  value  a1  =  1  for  large  sample  size 
(n  =  1000),  this  illustrates  the  consistency  of  this  estimator.  Exhibit  1.4  (b, 
d,  and  f)  show  three  histograms  of  the  resulting  10,000  values  of 
'/n(^ML  ~  1)  f°r  the  three  sample  sizes.  Whereas  for  n  =  10  the  skewness 
of  the  y2  distribution  is  still  visible,  for  «  =  1000  the  distribution  is  much 
more  symmetric  and  approaches  a  normal  distribution.  This  illustrates  the 
asymptotic  normality  of  the  maximum  likelihood  estimator  of  a1. 


E 
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(a) 


(c) 


( b ) 


(d) 


-4  -2  0  2  4  6 


Series:  SIGMA2HAT1000 
Sample  1  10000 
Observations  10000 

Mean  0.998829 

Median  0.997968 
Maximum  1.200872 
Minimum  0.834406 
Std.  Dev.  0.044717 
Skewness  0.106363 
Kurtosis  3.063276 


Series:  SER1000 
Sample  1  10000 
Observations  10000 

Mean  -0.037024 

Median  -0.064244 
Maximum  6.352126 
Minimum  -5.236548 
Std.  Dev.  1.414064 
Skewness  0.106363 
Kurtosis  3.063276 


Exhibit  1.14  Simulated  Normal  Random  Sample  (Example  1.10) 


Histograms  of  the  maximum  likelihood  estimates  of  the  error  variance  (denoted  by 
SIGMA2HAT,  shown  in  (a),  (c),  and  (e))  and  of  a  scaled  version  (defined  by  \/7i(d2ML  —  1) 
and  denoted  by  SER,  shown  in  ( b ),  ( d ),  and  (f))  for  random  drawings  of  the  standard 
normal  distribution,  with  sample  size  n  =  10  in  ((a)-(b)),  n  =  100  in  ((c)-(d)),  and  «  =  1000 
in  ((e)-(f)). 


^  Exercises:  T:  1.7b-e,  1.8d,  l.lOb-d,  1.12d. 
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1.4  Tests  of  hypotheses 


1.4.1  Size  and  power 

First  used  in  Section  2.3.1. 


Null  hypothesis  and  alternative  hypothesis 

When  observations  are  affected  by  random  influences,  the  same  holds  true 
for  all  inference  that  is  based  on  these  data.  If  one  wishes  to  evaluate 
hypotheses  concerning  the  data  generating  process,  one  should  take  the 
random  nature  of  the  data  into  account.  For  instance,  consider  the  hypoth¬ 
esis  that  the  data  are  generated  by  a  probability  distribution  with  mean  zero. 
In  general,  the  sample  mean  of  the  observed  data  will  not  be  (exactly)  equal 
to  zero,  even  if  the  hypothesis  is  correct.  The  question  is  whether  the  differ¬ 
ence  between  the  sample  mean  and  the  hypothetical  population  mean  is  due 
only  to  randomness  in  the  data.  If  this  seems  unlikely,  then  the  hypothesis  is 
possibly  not  correct. 

Now  we  introduce  some  terminology.  A  statistical  hypothesis  is  an  asser¬ 
tion  about  the  distribution  of  one  or  more  random  variables.  If  the  functional 
form  of  the  distribution  is  known  up  to  a  parameter  (or  vector  of  parameters), 
then  the  hypothesis  is  parametric;  otherwise  it  is  non-parametric.  If  the 
hypothesis  specifies  the  distribution  completely,  then  it  is  called  a  simple 
hypothesis;  otherwise,  a  composite  hypothesis.  We  restrict  the  attention  to 
parametric  hypotheses  where  one  assertion,  called  the  null  hypothesis  and 
denoted  by  Ho,  is  tested  against  another  one,  called  the  alternative  hypothesis 
and  denoted  by  H\.  Let  the  specified  set  of  distributions  be  given  by 
{fe;0  G  ©};  then  Ho  corresponds  to  the  assertion  that  0  G  0 o  and  H i  to  the 
assertion  that  9  G  ©i,  where  ©o  and  ©i  are  disjoint  subsets  of  0.  For 
instance,  if  9  is  the  (unknown)  population  mean,  then  we  can  test  the 
hypothesis  of  zero  mean  against  the  alternative  of  non-zero  mean.  In  this 
case  0o  =  {0}  and  ©i  =  {9  G  0;  9  ^  0}. 

Test  statistic  and  critical  region 

Let  0o  be  the  parameter  (or  vector  of  parameters)  of  the  data  generating 
process.  The  observed  data  (yi,  ■  ■  ■  ,yn)  are  used  to  decide  which  of  the  two 
hypotheses  (Hq  and  Hi)  seems  to  be  the  most  appropriate  one.  This  decision 
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is  made  by  means  of  a  test  statistic  t(y  1,  ■  ■  ■ ,  y„)  —  that  is,  an  expression  that 
can  be  computed  from  the  observed  data.  The  possible  outcomes  of  this 
statistic  are  divided  into  two  regions,  the  critical  region  (denoted  by  C)  and 
the  complement  of  this  region.  The  null  hypothesis  is  rejected  if  t  £  C  and  it  is 
not  rejected  if  tfC.  Note  that  we  say  that  Ho  is  not  rejected,  instead  of  saying 
that  Hq  is  accepted.  For  instance,  to  test  the  null  hypothesis  that  the  popula¬ 
tion  mean  0=0  against  the  alternative  that  0  0,  the  sample  mean  y 

provides  an  intuitively  appealing  test  statistic.  The  null  hypothesis  will  be 
rejected  if  y  is  ‘too  far  away’  from  zero.  For  instance,  one  can  choose  a  value 
c  >  0  and  reject  the  null  hypothesis  if  y|  >  c.  If  the  sample  is  such  that 
|y|  <  c,  then  the  hypothesis  is  not  rejected,  but  this  does  not  mean  that  we 
accept  the  null  hypothesis  as  a  factual  truth. 

Size  and  power 

In  general,  to  test  a  hypothesis  one  has  to  decide  about  the  test  statistic  and 
about  the  critical  region.  These  should  be  selected  in  such  a  way  that  one  can 
discriminate  well  between  the  null  and  alternative  hypotheses.  The  quality  of 
a  test  can  be  evaluated  in  terms  of  the  probability  n(9)  =  P[t  £  C]  to  reject 
the  null  hypothesis.  We  restrict  the  attention  to  sitnilar  tests  —  that  is,  test 
statistics  t{y i,  ■  ■  ■  ,yn)  that  are  pivotal  in  the  sense  that  the  distribution  under 
the  null  hypothesis  does  not  depend  on  any  unknown  parameters.  This 
means  that,  for  every  given  critical  region  C,  the  rejection  probability  n(9) 
can  be  calculated  for  9  £  ©o  and  that  (in  case  the  set  ©o  contains  more  than 
one  element)  this  probability  is  independent  of  the  value  of  9  £  ©o.  If  the  null 
hypothesis  is  valid  (so  that  the  data  generating  process  satisfies  9o  £  ©o)  but 
the  observed  data  lead  to  rejection  of  the  null  hypothesis  (because  t  £  C),  this 
is  called  an  error  of  the  first  type.  The  probability  of  this  error  is  called  the 
size  or  also  the  significance  level  of  the  test.  For  similar  tests,  the  size  can  be 
computed  for  every  given  critical  region.  On  the  other  hand,  if  the  null 
hypothesis  is  false  (as  0^0 o)  but  the  observed  data  do  not  lead  to  rejection 
of  the  null  hypothesis  (because  tfC),  this  is  called  an  error  of  the  second 
type.  The  rejection  probability  n (9)  for  9f  ©o  is  called  the  power  of  the  test. 
A  test  is  called  consistent  if  the  power  n(9)  converges  to  1  for  all  0^  ©o  if 
n  =*  oo. 

Tests  with  a  given  significance  level 

Of  course,  for  practical  applications  one  prefers  tests  that  have  small  size  and 
large  power  in  finite  samples.  A  perfect  test  would  be  one  where  the  probabil¬ 
ity  to  make  a  mistake  is  zero  —  that  is,  with  n(9)  =  0  for  0  e  ©o  and  n(9)  =  1 
for  0  £  ©o.  This  is  possible  only  if  the  parameter  values  can  be  inferred  with 
absolute  certainty  from  the  observed  data,  but  then  there  is,  of  course,  no  need 
for  tests  anymore.  In  practice  one  often  fixes  a  maximally  tolerated  size  —  for 
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instance,  5  per  cent  —  to  control  for  errors  of  the  first  type.  It  then  remains  to 
choose  a  test  statistic  and  a  critical  region  with  good  power  properties  —  that 
is,  with  small  probabilities  of  errors  of  the  second  type.  In  this  book  we  mostly 
use  intuitive  arguments  to  construct  econometric  tests  of  a  given  size,  and  we 
pay  relatively  little  attention  to  their  power  properties.  However,  many  of  the 
standard  econometric  tests  can  be  shown  to  have  reasonably  good  power 
properties. 

Note  that  the  null  and  alternative  hypotheses  play  different  roles  in  testing, 
because  a  small  size  is  taken  as  a  starting  point.  This  means  that  an  econo¬ 
metrician  should  try  to  formulate  tests  in  such  a  way  that  errors  of  the  first 
type  are  more  serious  than  errors  of  the  second  type. 

The  meaning  of  significance 

One  should  distinguish  statistical  significance  from  practical  significance. 
In  empirical  econometrics  we  analyse  data  that  are  the  result  of  economic 
processes  that  are  often  relatively  involved.  The  purpose  of  an  econometric 
model  is  to  capture  the  main  aspects  of  interest  of  these  processes.  Tests  can 
help  to  find  a  model  that  is  reasonable,  given  the  information  at  hand. 
Hereby  the  less  relevant  details  are  neglected  on  purpose.  One  should  not 
always  blindly  follow  rules  of  thumb  like  significance  levels  of  5  per  cent  in 
testing.  For  example,  in  large  samples  nearly  every  null  hypothesis  will  be 
rejected  at  the  5  per  cent  significance  level.  In  many  cases  the  relevant 
question  is  not  so  much  whether  the  null  hypothesis  is  exactly  correct,  but 
whether  it  is  a  reasonable  approximation.  This  means,  for  instance,  that 
significance  levels  should  in  practice  be  taken  as  a  decreasing  function  of 
the  sample  size. 

Example  1.11:  Simulated  Normal  Random  Sample  (continued) 

To  illustrate  the  power  of  tests,  we  consider  a  simulation  experiment 
where  the  data  y„  i  =  1,  •  •  • ,  n  are  generated  by  independent  drawings  from 
N(yU,  1).  We  assume  that  the  modeller  knows  that  the  data  are  generated  by 
an  NID  process  with  known  variance  a2  =  1  but  that  the  mean  f.i  is  un¬ 
known.  We  will  test  the  null  hypothesis  Ho  :  /i  =  0  against  the  alternative 
Hi :  ji  ^  0.  We  will  now  discuss  (i)  two  alternative  test  statistics  (the  mean 
and  the  median),  (ii)  the  choice  of  critical  regions  (for  fixed  significance 
level),  (iii)  the  set-up  of  the  simulation  experiment,  and  (iv)  the  outcomes 
of  this  experiment. 

(i)  Two  test  statistics:  mean  and  median 

We  will  analyse  the  properties  of  two  alternative  estimators  of  /j  —  namely, 
the  sample  mean  y  and  the  sample  median  med(y).  Both  estimators  can  be 
used  to  construct  test  statistics  to  test  the  null  hypothesis. 


E 
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(ii)  Choice  of  critical  regions 

As  both  statistics  (sample  mean  and  sample  median)  have  a  distribution  that 
is  symmetric  around  the  population  mean,  intuition  suggests  that  we  should 
reject  the  null  hypothesis  if  |y|  >  c\  (if  we  use  the  sample  mean)  or  if 
|med(y)|  >  cz  (if  we  use  the  sample  median).  We  fix  the  size  of  the  tests  at  5 
per  cent  in  all  cases.  The  critical  values  ci  and  cz  are  determined  by  the 
condition  that  P[  —  c\  <  y  <  c\  \  =  P[  —  cz  <  med(y)  <  cz ]  =  0.95  when 
/f  =  0.  We  consider  sample  sizes  of  n  =  10,  n  =  100,  and  n  =  1000,  for 
which  ci  =  1.96/y^  (because  a  is  known  to  the  modeller)  and  cz  is  approxi¬ 
mately  0.73,  0.24,  and  0.08  respectively.  The  values  of  cz  are  obtained  from  a 
simulation  experiment  with  100,000  runs,  where  in  each  run  the  median  of  a 
sample  y,  ~  N(0, 1), i  =  1,  •  •  •  ,n  is  determined. 

(iii)  Simulation  experiment 

To  investigate  the  power  properties  of  the  two  test  statistics,  we  consider  as 
data  generating  processes  y,  ~  N(/t,  1)  for  a  range  of  eleven  values  for  r 
between  n  =  —2  and  /j.  =  2,  including  n=  0  (see  Exhibit  1.15  for  the  precise 
values  of  f.i  in  the  eleven  experiments).  This  leads  to  in  total  thirty-three 
simulation  experiments  (three  sample  sizes  n  =  10, 100,  or  1000,  for  each  of 
the  eleven  values  of  ji).  For  each  experiment  we  perform  10,000  simulation 
runs  and  determine  the  frequency  of  rejection  of  the  null  hypothesis,  both  for 
the  sample  mean  and  for  the  sample  median. 


Population 

mean 

n  =  10 

n  =  100 

n  =  1000 

Mean 

Median 

Mean 

Median 

Mean 

Median 

n  =  -2 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 

/<  =  -1 

89.0 

77.2 

100.0 

100.0 

100.0 

100.0 

H  =  -0.2 

9.7 

8.8 

51.2 

35.9 

100.0 

99.9 

H  =  -0.1 

6.4 

5.9 

17.3 

13.2 

88.5 

70.7 

/i  =  —0.05 

5.4 

5.5 

7.3 

6.9 

35.5 

23.6 

fi  =  0 

5.0 

5.0 

4.7 

5.0 

4.8 

4.8 

/i  =  0.05 

5.2 

5.0 

7.7 

7.5 

34.4 

24.1 

n  =  0.1 

6.4 

6.1 

16.7 

12.5 

88.3 

71.1 

/i  =  0.2 

9.5 

7.8 

51.6 

37.2 

100.0 

99.9 

H  =  1 

88.4 

76.7 

100.0 

100.0 

100.0 

100.0 

fi  =  2 

100.0 

100.0 

100.0 

100.0 

100.0 

100.0 

Exhibit  1.15  Simulated  Normal  Random  Sample  (Example  1.11) 

Results  of  simulation  experiments  with  random  samples  of  different  sizes  and  with  different 
means  of  the  data  generating  process.  The  numbers  in  the  table  report  the  rejection  percentages 
(over  10,000  simulation  runs)  of  the  null  hypothesis  that  ^  =  0,  using  tests  of  size  5%  based  on 
the  sample  mean  and  on  the  sample  median. 
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(iv)  Outcomes  of  the  simulation  experiment 

The  results  are  in  Exhibit  1.15.  For  f.t  =  0  this  indicates  that  the  size  is  indeed 
around  5  per  cent.  For  f.i  ^  0  the  power  of  the  tests  increases  for  larger 
samples,  indicating  that  both  tests  are  consistent.  The  sample  mean  has 
higher  power  than  the  sample  median.  This  means  that,  for  the  normal 
distribution,  the  sample  mean  is  to  be  preferred  above  the  sample  median 
to  perform  tests  on  the  population  mean.  Note  that  for  n  =  1000  the  null 
hypothesis  that  =  0  is  rejected  nearly  always  (both  by  the  mean  and  by  the 
median)  if  the  data  generating  process  has  f.i  =  0.2.  It  depends  on  the  par¬ 
ticular  investigation  whether  the  distinction  between  fi  =  0  and  f.i  =  0.2  is 
really  of  interest  —  that  is,  whether  this  difference  is  of  practical  significance. 


1.4.2  Tests  for  mean  and  variance 

First  used  in  Section  2.3.1. 


Two-sided  test  for  the  mean 

To  illustrate  the  general  principles  discussed  in  Section  1.4.1,  we  consider 
tests  for  the  mean  and  variance  of  a  population.  It  is  assumed  that  the  data 
Ji,i=  1,  •  •  • ,  n  consist  of  a  random  sample  from  a  normal  distribution,  so 
that  y,  ~  NID(/f,  a2),  i  =  1,  •  •  • ,  n.  Both  the  mean  f.i  and  the  variance  a 2  of  the 
population  are  unknown,  so  that  6  =  (/f,  a1).  First  we  test  a  hypothesis  about 
the  mean  —  for  instance, 


H0  :  Li  =  l-i0,  Hx-.h^iiq, 

where  f.i0  is  a  given  value.  This  is  called  a  two-sided  test,  as  the  alternative 
contains  values  >  /f0  as  well  as  values  <  /t0.  As  test  statistic  we  consider 
the  sample  mean  y  and  we  reject  the  null  hypothesis  if  \y  —  1  >  c ■  This 

defines  the  critical  region  of  the  test,  and  c  determines  the  size  of  the  test. 
According  to  (1.33),  for  f.i  =  [i0  we  get 


y-v  o 


N(0, 1). 


However,  the  expression  on  the  left-hand  side  is  not  a  statistic,  as  a 2  is 
unknown.  If  this  is  replaced  by  the  unbiased  estimator  s2,  we  obtain, 
according  to  (1.36), 


t 


y-  ho 


r-j 


s  sjn 


t(n  —  1). 


(1.52) 
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Because  all  the  terms  in  this  expression  are  known  (/t0  is  given  and  y  and  s  can 
be  computed  from  the  data),  this  is  a  test  statistic.  It  is  also  a  similar  test 
statistic,  as  the  statistic  t  is  pivotal  in  the  sense  that  the  distribution  does  not 
depend  on  the  unknown  variance  a2 .  The  test  (1.52)  is  called  the  t-test  for  the 
mean.  The  null  hypothesis  is  rejected  if  |f|  >  c,  where  the  value  of  c  is  chosen 
in  accordance  with  the  desired  size  of  the  test.  So  the  null  hypothesis  is 
rejected  if  y  falls  in  the  critical  region 

55 

y  <  Vo  -  c~i=  or  y  >  Po  +  c~i=-  (1.53) 

V«  \Jn 

The  size  of  this  test  is  equal  to  P[\t\  >  c\,  where  t  follows  the  t(n  —  1) 
distribution.  For  size  5  per  cent,  the  critical  value  for  n  =  20  is  around 
2.09;  for  n  —  60  it  is  around  2.00,  and  for  n  — »  oo  it  converges  to  1.96.  As 
a  rule  of  thumb,  one  often  takes  c  ~  2.  Note  that  s/y/n  is  the  estimated 
standard  deviation  of  the  sample  mean;  see  (1.33),  where  a  is  replaced  by  s. 
The  estimated  standard  deviation  s/y/n  is  called  the  standard  error  of  the 
sample  mean.  So  the  null  hypothesis  is  rejected  if  the  sample  mean  is  more 
than  two  standard  errors  away  from  the  postulated  mean  Ho-  In  this  case  one 
says  that  the  sample  mean  differs  significantly  from  Ho,  or  (when  Ho  =  0)  that 
the  sample  mean  is  significant  at  the  5  per  cent  significance  level. 

One-sided  test  for  the  mean 

In  some  cases  it  is  of  interest  to  test  the  null  hypothesis  Ho'.  H  =  Ho  against  the 
one-sided  alternative  Hi  :  n  >  Ho-  This  is  called  a  one-sided  test.  The  test 
statistic  is  again  as  given  in  (1.52),  but  now  the  null  hypothesis  is  rejected  if 
t  >  c  with  size  equal  to  P[t  >  c ]  where  t  ~  t(n  —  1).  The  critical  value  for 
n  =  20  is  around  1.73  in  this  case;  for  n  —  60  it  is  around  1.67,  and  for 
n  — »•  oo  it  approaches  1.645.  This  test  can  also  be  used  for  the  null  hypothesis 
Hq\  n<  i.t0  against  the  alternative  Hi  :  n  >  Ho-  Tests  for  Hq:  h  =  Ho  or 
Hq  :  fi>  Ho  against  H i  :  /<  <  Ho  are  performed  in  a  similar  fashion,  where 
Ho  is  rejected  for  small  values  of  t. 

Probability  value  (P-value) 

In  practice  it  may  not  be  clear  how  to  choose  the  size.  In  principle  this  depends 
on  the  consequences  of  making  errors  of  the  first  and  second  type,  hut  such 
errors  are  often  difficult  to  determine.  Instead  of  fixing  the  size  —  for  instance, 
at  5  per  cent  —  one  can  also  leave  the  size  unspecified  and  compute  the  value 
of  the  test  statistic  from  the  observed  sample.  One  can  then  ask  for  which  sizes 
this  test  outcome  would  lead  to  rejection  of  the  null  hypothesis.  As  larger  sizes 
correspond  to  larger  rejection  probabilities,  there  will  be  a  minimal  value  of 
the  size  for  which  the  null  hypothesis  is  rejected.  This  is  called  the  probability 
value  or  P-value  of  the  test  outcome.  That  is,  the  null  hypothesis  should  be 
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rejected  for  all  sizes  larger  than  P,  and  it  should  not  be  rejected  for  all  sizes 
smaller  than  P.  Stated  otherwise,  the  null  hypothesis  should  he  rejected  for 
small  values  of  P  and  it  should  not  be  rejected  for  large  values  of  P. 

P-value  for  the  mean 

As  an  example,  let  to  be  the  calculated  value  of  the  t-test  for  the  null 
hypothesis  Ho  :  n  =  f.i0  against  the  two-sided  alternative  H then 
the  P-value  is  given  by  P  =  P[t  <  to  or  t  >  to]-  If  this  P-value  is  small,  this 
means  that  outcomes  of  the  test  statistic  so  far  away  from  zero  are  improb¬ 
able  under  the  null  hypothesis,  so  that  the  null  hypothesis  should  be  rejected. 
Note  that  the  P-value  depends  on  the  form  of  the  (one-sided  or  two-sided) 
alternative  hypothesis.  For  instance,  when  Ho  :  n  =  /f0  is  tested  against 
Hi :  >  f.i0 ,  then  P  =  P[t  >  to].  This  is  illustrated  graphically  in  Exhibit 

1.16.  In  general,  the  P-value  can  be  defined  as  the  probability  (under  the 
null  hypothesis)  of  getting  the  observed  outcome  of  the  test  statistic  or  a  more 
extreme  outcome  —  that  is,  the  P-value  is  the  corresponding  (one-sided  or 
two-sided)  tail  probability. 

Chi-square  test  for  the  variance 

Next  we  consider  tests  on  the  variance.  Again  it  is  assumed  that  the  data 
consist  of  a  random  sample  y,  ~  N(/t,  er2),  i  =  1,  •  •  • ,  n,  with  f.i  and  a1  un¬ 
known.  Let  the  null  hypothesis  be  Ho  :  er2  =  Uq  and  the  (one-sided)  alterna¬ 
tive  Hi  :  a2  >  <7q.  If  the  null  hypothesis  holds  true,  then  (y,  —  /d/^o  ~  N(0,  1) 
are  independent,  so  that 


i=  1 


P-value  of  a  one-sided  test  [(a),  for  the  one-sided  alternative  that  the  parameter  is  larger 
than  zero)  and  of  a  two-sided  test  ((b),  for  the  two-sided  alternative  that  the  parameter  is 
not  zero,  with  equal  areas  in  both  tails).  The  arrow  indicates  the  outcome  of  the  test 
statistic  calculated  from  the  observed  sample.  In  (a)  the  P-value  is  equal  to  the  surface  of 
the  shaded  area,  and  in  (b)  the  P-value  is  equal  to  the  sum  of  the  surfaces  of  the  two 
shaded  areas. 
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However,  this  is  not  a  test  statistic,  as  is  unknown.  When  this  parameter  is 
replaced  by  its  estimate  y,  we  obtain,  according  to  (1.35), 

\  Y^iyi-y)2  =— — (1-54) 

The  null  hypothesis  is  rejected  for  large  values  of  this  test  statistic,  with 
critical  value  determined  from  the  y2[n  —  1)  distribution.  For  other  hypoth¬ 
eses —  for  instance,  Hq:o2  =  Oq  against  Oq  —  the  same  test  statistic 

can  be  used  with  appropriate  modifications  of  the  critical  regions. 

F-test  for  equality  of  two  variances 

Finally  we  discuss  a  test  to  compare  the  variances  of  two  populations.  Suppose 
that  the  data  consist  of  two  independent  samples,  one  of  n\  observations 
distributed  as  NID(^q,  er2)  and  the  other  of  «2  observations  distributed  as 
NID(/q,  o\).  Consider  the  null  hypothesis  Ho  :  er2  =  a\  of  equal  variances 
against  the  alternative  H\  :  a\  ^  a\  that  the  variances  are  different.  Let  s2  be 
the  sample  variance  in  the  first  sample  and  s\  that  in  the  second  sample.  As  the 
two  samples  are  assumed  to  be  independent,  the  same  holds  true  for  s2  and  s\. 
Further  (1.35)  implies  that  (n,  —  l)sj/aj  ~  x2(ni  ~  1)  f°r  *  =  1,  2,  so  that 
(s\/ o\) / (s\/ o\)  ~  F(n 2  —  1,  n\  —  1).  When  the  null  hypothesis  a\  =  o\  is 
true,  it  follows  that 
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%~F(n2  -  1,  «t  -  1)  (1.55) 

si 

and  the  null  hypothesis  is  rejected  if  this  test  statistic  differs  significantly 
from  1.  The  critical  values  can  be  obtained  from  the  F(ri2  —  1,  n\  —  1) 
distribution.  Note  that  this  test  statistic  is  similar,  as  its  distribution  does 
not  depend  on  the  unknown  parameters  f.i1  and  Hi- 

The  testing  problem  Ho :  /q  =  /q  against  the  alternative  Hi  :  /q  ^  n2  is 
more  complicated  and  will  be  discussed  later  (see  Exercise  3.10). 

Example  1.12:  Student  Learning  (continued) 

We  illustrate  tests  for  the  mean  and  variance  by  considering  the  random 
variable  consisting  of  the  FGPA  score  of  students  at  the  Vanderbilt  University 
(see  Example  1.1).  We  will  discuss  (i)  a  test  for  the  mean  and  (ii)  a  test  for  the 
equality  of  two  variances. 

(i)  Test  for  the  mean 

Suppose  that  the  mean  value  of  this  score  over  a  sequence  of  previous  years  is 
equal  to  2.70  (this  is  a  hypothetical,  non-random  value).  Further  suppose 
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that  the  individual  FGPA  scores  of  students  in  the  current  year  are  independ¬ 
ently  and  normally  distributed  with  unknown  mean  f.i  and  unknown  variance 
a2.  The  university  wishes  to  test  the  null  hypothesis  Ho  :  n  =  2.70  of  average 
scores  against  the  alternative  hypothesis  Hi :  f.i  >  2.70  that  the  scores  in  the 
current  year  are  above  average. 

The  FGPA  scores  in  the  current  year  of  609  students  of  this  university  are 
summarized  in  Exhibit  1.4  (a).  The  sample  mean  and  standard  deviation 
are  y  =  2.793  and  s  =  0.460  (after  rounding;  see  Exhibit  1.4  ( a )  for  the 
more  precise  numbers).  So  the  sample  average  is  above  2.70.  The  question 
is  whether  this  should  be  attributed  to  random  fluctuations  in  student 
scores  or  whether  the  student  scores  are  above  average  in  the  current  year. 
Under  the  null  hypothesis  that  /f  =  2.70,  it  follows  from  (1.52)  that 
(y  —  2.70)/(s/ \/609)  t(608).  The  one-sided  critical  value  for  size  5  per 

cent  is  (approximately)  1.645.  If  we  substitute  the  values  of  y  and  s  as 
calculated  from  the  sample,  this  gives  the  value  t  =  4.97.  As  this  outcome 
is  well  above  1.645,  this  leads  to  rejection  of  the  null  hypothesis.  The  P-value 
of  this  test  outcome  is  around  10~6.  It  seems  that  the  current  students 
have  better  scores  on  average  than  the  students  in  previous  years. 

(ii)  Test  for  equality  of  two  variances 

Next  we  split  the  sample  into  two  parts,  males  and  females,  and  we  assume 
that  all  scores  are  independently  distributed  with  distribution  N(/q,  a\)  for 
males  and  N (f.i2,  o\)  for  females.  The  sample  means  and  standard  deviations 
for  both  sub-samples  are  in  Exhibit  1.6  —  that  is,  n\  =  373  and  n2  =  236,  the 
sample  means  are  y1  =  2.728  and  y2  =  2.895,  and  the  sample  standard 
deviations  are  si  =  0.441  and  si  =  0.472.  We  test  whether  a\  =  a 2  against 
the  alternative  that  a\  ^  a\  by  means  of  (1.55).  The  outcome  is 
F  =  s\/s\  =  (0.472)2/(0.441)2  =  1.14,  and  for  the  corresponding  F( 235, 
372)  distribution  this  gives  a  (two-sided)  P-value  of  around  0.26.  The  null 
hypothesis  of  equal  variances  is  not  rejected  (at  5  per  cent  significance  level). 
That  is,  there  is  no  significant  difference  in  the  variance  of  the  scores  for  male 
and  female  students. 

Exercises:  E:  1.12a-c,  e. 


1.4.3  Interval  estimates  and  the  bootstrap 

First  used  in  Section  2.3.1. 


Interval  estimates 

Although  a  point  estimate  may  suggest  a  high  precision,  this  neglects  the 
random  nature  of  estimators.  Because  estimates  depend  on  data  that  are 
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partly  random,  it  is  sometimes  preferred  to  give  an  interval  estimate  of 
the  parameter  instead  of  a  point  estimate.  This  interval  indicates  the  uncer¬ 
tainty  about  the  actual  value  of  the  parameter.  When  the  interval  is  con¬ 
structed  in  such  a  way  that  it  contains  the  true  parameter  value  with 
probability  1  —  a,  then  it  is  called  a  (1  —  a)  x  100  per  cent  interval 
estimate.  One  method  to  construct  such  an  interval  is  to  use  a  test  of  size  a 
and  to  include  all  parameter  values  0*  for  which  the  null  hypothesis 
Ho  :  0  =  0*  is  not  rejected.  Indeed,  if  the  true  parameter  value  is  0o,  then  for 
a  test  of  size  a  the  probability  that  Ho  :  0  =  0o  is  rejected  is  precisely  a,  so  that 
the  probability  that  the  constructed  interval  contains  0o  is  1  —  a.  For 
example,  assuming  that  the  observations  are  NID(/g  er2),  a  95  per  cent 
interval  estimate  for  the  mean  is  given  by  all  values  /f  for  which 

y  -  c4=  <  n  <  y  +  c4=,  (1.56) 

\Jn  sjn 


where  c  is  such  that  P[  —  c  <  t  <  c\  =  0.95  when  t  has  the  t(n  —  1)  distribu¬ 
tion.  If  fi0  is  the  true  population  mean,  the  complementary  set  of  outcomes  in 
(1.53)  has  a  probability  of  5  per  cent,  so  that  the  interval  in  (1.56)  has  a 
probability  of  95  per  cent  to  contain  n0.  In  a  similar  way,  using  (1.54),  a  95 
per  cent  interval  estimate  for  the  variance  a2  is  given  by 

{n-\)s2  2  (ra-l)s2 

-  \  (J  \  , 

ci  c  1 

where  c\  <  ci  are  chosen  such  that  P|ci  <t<C}\  =  0.95  when  t  has  the 
y2(n-  1)  distribution.  For  instance,  one  can  take  these  values  so  that 
P[t  <  ci]  =  P[t  >  ci]  =  0.025. 

Approximate  tests  and  approximate  interval  estimators 

Until  now  the  attention  has  been  restricted  to  data  consisting  of  random 
samples  from  the  normal  distribution.  In  this  case,  tests  can  be  constructed 
for  the  mean  and  variance  that  have  a  known  distribution  in  finite  samples.  In 
other  cases,  the  distribution  of  the  estimator  0  in  finite  samples  is  not  known. 
When  the  asymptotic  distribution  is  known,  this  can  be  used  to  construct 
asymptotic  tests  and  corresponding  interval  estimates.  For  instance,  if  a 
maximum  likelihood  estimator  0„  is  asymptotically  normally  distributed 

Me,,  -  0O)  ^  m  u2), 

then  in  finite  samples  we  can  take  as  an  approximation  0„  «  N(0o,  o2 /n), 
where  a2  is  a  consistent  estimator  of  a2.  This  means  that 
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On  ~  Oq 


o/y/n 


N(0,  1). 


This  is  very  similar  to  (1.52).  For  instance,  let  Ho  :  0  =  Oq  and  Hi :  0  0o; 

then  the  null  hypothesis  is  rejected  if 


0  <  6o  -  c  —=  or  0  >  do  +  c  —= , 
\/n  \/n 


where  P[|t|  >  c\  with  t  ~  N(0,  1)  is  the  approximate  size  of  the  test.  For  a  5 
per  cent  size  there  holds  ck,  2.  An  approximate  95  per  cent  interval  estimate 
of  0  is  given  by  all  values  in  the  interval 

0  —  2-^=  <  0  <  0  +  2-^=. 
sjn  yjn 


Bootstrap  method 

An  alternative  to  asymptotic  approximations  is  to  use  the  bootstrap  method. 
This  method  has  the  attractive  property  that  it  does  not  require  knowledge  of 
the  shape  of  the  probability  distribution  of  the  data  or  of  the  estimator.  It 
is,  therefore,  said  to  be  distribution  free.  The  bootstrap  method  uses  the 
sample  distribution  to  construct  an  interval  estimate.  We  discuss  this  for 
the  case  of  random  samples.  If  the  observations  all  have  different  values  (that 
is,  y,  ^  jj  for  i  ^  j,  i,j  —  1,  ■  ■  ■ ,  rt),  then  the  bootstrap  probability  distribu¬ 
tion  of  the  random  variable  y  is  the  discrete  distribution  with  outcome  set 
V  =  {yi,  ■  ■  • ,  y„}  and  with  probabilities 


P[y  =  y,\  =  ~-  d-57) 

n 

The  distribution  of  a  statistic  t(y i,---,y„)  is  simulated  as  follows.  In 
one  simulation  run,  n  observations  are  randomly  drawn  (with  replacement) 
from  the  distribution  (1.57)  and  the  corresponding  value  of  t  is  calculated. 
Repeating  this  in  a  large  number  of  runs  (always  with  the  same  distribution 
(1.57),  which  is  based  on  the  original  data),  this  provides  an  accurate 
approximation  of  the  distribution  of  t  if  (1.57)  would  be  the  data  generating 
process.  In  reality  it  will,  of  course,  be  only  an  approximation  of  the  data 
generating  process.  However,  the  bootstrap  is  a  simple  method  to  get  an  idea 
of  possible  random  variations  when  there  is  little  information  about  the 
probability  distribution  that  generates  the  data. 
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Example  1.13:  Student  Learning  (continued) 

We  will  construct  two  interval  estimates  of  the  mean  of  the  FGPA  scores  — 
namely,  (i)  based  on  the  assumed  normal  distribution  of  FGPA  scores,  and  (ii) 
obtained  by  the  bootstrap  method. 

(i)  Interval  based  on  normal  distribution 

We  construct  interval  estimates  of  the  mean  of  FGPA,  both  for  the  combined 
population  (denoted  by  ji)  and  for  males  and  females  (n2)  separately.  If 
we  assume  that  the  scores  are  in  all  three  cases  normally  distributed,  the 
interval  estimates  can  be  computed  from  (1.56).  The  resulting  95  per  cent 
interval  estimates  are 


2.76  <  /t  <  2.83,  2.68  <  ^  <  2.77,  2.83  <  /t2  <  2.96. 

Note  that  the  interval  estimate  of  f.i1  is  below  that  of  /t2,  which  suggests  that 
the  two  means  differ  significantly.  If  we  do  not  assume  that  the  individual 
scores  are  normally  distributed  but  apply  the  asymptotic  normal  approxima¬ 
tion  for  the  sample  means,  then  the  corresponding  asymptotic  interval  esti¬ 
mates  for  the  mean  are  the  same  as  before. 

(ii)  Interval  obtained  by  the  bootstrap  method 

Although  the  sample  sizes  are  relatively  large  in  all  three  cases,  we  will 
consider  the  bootstrap  as  an  alternative  to  construct  an  interval  estimate 
for  /i2  (as  the  corresponding  sample  size  n2  =  23 6  is  the  smallest  of  the  three 
cases).  For  this  purpose,  the  bootstrap  distribution  (1.57)  consists  of  the  236 
FGPA  scores  of  female  students.  We  perform  10,000  simulation  runs.  In  each 


(a) 


(b) 


Lower  value  95%  bootstrap  interval  2.8338220 
Upper  value  95%  bootstrap  interval  2.9551695 


Exhibit  1.17  Student  Learning  (Example  1.13) 


Bootstrap  distribution  (a)  of  the  sample  mean  obtained  by  10,000  simulation  runs  (each  of 
sample  size  236)  from  the  bootstrap  distribution  of  the  FGPA  scores  of  female  students,  and 
95%  bootstrap  interval  estimate  for  the  population  mean  fi2  (b)- 
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run,  236  IID  observations  are  drawn  from  the  bootstrap  distribution  and  the 
corresponding  sample  mean  is  calculated.  This  gives  a  set  of  10,000  simu¬ 
lated  values  of  the  sample  mean,  with  histogram  given  in  Exhibit  1.17  (a). 
Deleting  the  250  smallest  and  the  250  largest  values  of  the  sample  mean,  we 
obtain  the  95  per  cent  bootstrap  interval  estimate  2.83  <  n2<  2.96  (see 
Exhibit  1.17  (b)).  This  interval  coincides  (within  this  precision)  with  the 
earlier  interval  that  was  based  on  the  normal  distribution,  but  this  is  a 
coincidence,  as  in  general  the  two  intervals  will  be  different.  Here  the 
outcomes  are  very  close,  because  the  sample  size  (n  =  236)  is  large  enough 
to  use  the  normal  distribution  of  the  sample  mean  as  a  reasonable  approxi¬ 
mation. 

Exercises:  S:  1.14,  1.15c-g;  E:  1.13c,  e,  g. 
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SUMMARY 

This  chapter  gives  a  concise  review  of  the  main  statistical  concepts  and 
methods  that  are  used  in  the  rest  of  this  book.  We  discussed  methods  to 
describe  observed  data  by  means  of  graphs  and  sample  statistics.  Random 
variables  provide  a  means  to  describe  the  random  nature  of  economic 
data.  The  normal  distribution  and  distributions  related  to  the  normal  distri¬ 
bution  play  a  central  role.  We  considered  different  methods  to  estimate  the 
parameters  of  distributions  from  observed  data  and  we  discussed  statistical 
properties  such  as  unbiasedness,  efficiency,  consistency,  and  asymptotic  dis¬ 
tributions  of  estimators.  Further  we  discussed  hypothesis  testing  and  related 
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Exercises 


THEORY  QUESTIONS 

1.1  Section  1.1.2) 

Suppose  that  n  pairs  of  outcomes  (x,-,  y,)  of  the 

variables  x  and  y  have  been  observed. 

a.  Prove  that  the  sample  correlation  coefficient  be¬ 
tween  the  variables  x  and  y  always  lies  between 
—  1  and  +1.  It  may  be  helpful  to  consider  the 
function  S(b)  =  Yl"i=  l  (y<  ~V~  b(xi  ~  x))2  and 
to  use  the  fact  that  the  minimal  value  of  this 
function  is  non-negative. 

b.  Prove  that  the  sample  correlation  coefficient 
is  invariant  under  the  linear  transformation 
x*  =  a\x,  +  b i  and  y*  =  a2yt  +  b2  for  all 
i  =  1,  •  •  • ,  n,  with  a\  >  0  and  a2  >  0  positive 
constants. 

c.  Show,  by  means  of  an  example,  that  the  sample 
correlation  coefficient  is  in  general  not  invariant 
under  non-linear  transformations. 

d*.  Prove  that  the  sample  correlation  coefficient  is 
equal  to  1  or  —  1  if  and  only  if  y  is  a  linear  function 
of  x  —  that  is,  y,  =  a  +  bxt,  i  =  1,  •  •  • ,  n,  for 
some  numbers  a  and  b  that  do  not  depend  on  i. 

1.2  (=®  Section  1.2.2) 

a*.  Prove  the  results  in  (1.15)  and  (1.16). 

b.  Suppose  that  x  and  y  are  independent  random 
variables.  Prove  that  the  conditional  distribu¬ 
tion  of  y|x  =  v  then  does  not  depend  on  the 
value  of  v  and  that  therefore  the  conditional 
mean  and  variance  of  y|x  =  v  also  do  not 
depend  on  v . 

c.  Prove  that  independent  variables  are  uncorrel¬ 
ated. 

d.  Give  an  example  of  two  random  variables  that 
are  uncorrelated  but  not  independent. 

1.3  (-©  Sections  1.2.1,  1.2.2) 

a.  Prove  the  result  in  (1.10)  for  the  case  of  a  linear 
transformation  z  =  ay  +  b  with  a  /  0. 


b.  Prove  the  two  results  in  (1.18). 

c.  Suppose  that  y\  and  y 2  are  independent  random 
variables  and  that  Z\  =  gi(yi)  and  zi  =  gi(yi)- 
Use  the  results  in  (1.10)  and  (1.19)  to  prove  that 
Zi  and  zi  are  independent. 

d.  Show,  by  means  of  an  example,  that  the  result  in 
c  does  not  generally  hold  true  when  ‘independ¬ 
ent’  is  replaced  by  ‘uncorrelated’. 

1.4  (“t?  Section  1.2.3) 

a.  Show  that  the  mean  and  variance  of  the  Ber¬ 
noulli  distribution  are  equal  to  p  and  p(  1  —  p) 
respectively. 

b.  Show  that  the  mean  and  variance  of  the  bino¬ 
mial  distribution  are  equal  to  np  and  np(  1  —  p) 
respectively. 

c.  Show  the  result  in  (1.23)  for  the  case  that  A 
is  an  n  x  n  non-singular  matrix,  by  using  the 
generalization  of  the  result  in  (1.19)  to  the  case 
of  n  functions. 

d.  Show  that,  when  two  jointly  normally  distrib¬ 
uted  random  variables  are  uncorrelated,  they 
are  also  independent. 

e*.  Show  that  the  first  four  moments  of  the 
normal  distribution  N(/i,  a2)  are  equal  to 
ji\  =  p,  p2  =  P3  =  0,  and  p4  =  3cr4.  Show 
that  the  skewness  and  kurtosis  are  equal  to 
zero  and  three  respectively. 

f.  Let  y  ~  N (p,  a2)  and  let  z  =  ay  +  b  with  a  /  0; 
then  prove  that  z  ~  N(ap  +  b,  a2c r2). 
g*.  Show  the  result  in  (1.22). 

1.5  (-©  Sections  1.2.3,  1.2.4) 

Let  y  ~  N(0,  I)  be  an  n  x  1  vector  of  independent 
standard  normal  random  variables  and  let  zo  =  Ay, 
Zi  =  y'Qiy  and  zz  =  y' Qiy  with  A  a  given  m  x  n 
matrix  and  with  Qi  and  Qi  given  symmetric  idem- 
potent  n  x  n  matrices. 
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a.  If  the  rank  of  Qi  is  equal  to  r,  then  Qi  =  UU' 
for  an  n  x  r  matrix  U  with  the  property  that 
U'U  =  I  (the  rxr  identity  matrix).  Use  this 
result  to  prove  that  y'Q\y  ~  y2(r). 

b.  Show  that  the  mean  and  variance  of  the  y2(r) 
distribution  are  r  and  2 r  respectively. 

c.  Prove  the  results  in  (1.29)  and  (1.30),  using  the 
fact  that  jointly  normally  distributed  random 
variables  are  independent  if  and  only  if  they 
are  uncorrelated. 

d.  Let  yr  ~  f(r);  then  show  that  for  r  — >  oo  the 
random  variables  yr  converge  in  distribution  to 
the  standard  normal  distribution. 

e.  Let  yr  ~  F(r\,  r)  with  r\  fixed;  then  show  that  for 
r  — >  oo  the  random  variables  r\  ■  yr  converge  in 
distribution  to  the  y2(r i)  distribution. 

f.  Show  that  the  matrix  M  in  (1.34)  is  symmetric 
and  idempotent  and  that  it  has  rank  (n  —  1). 

1.6  (“®  Sections  1.3.1,  1.3.2) 

Let  y,  ~  NID(/t,  cr2),  i  =  1,  •  •  • ,  n.  In  Example  1.8 
the  log-likelihood  function  (1.40)  was  analysed  in 
terms  of  the  parameters  9  =  (p,o2),  but  now  we 
consider  as  an  alternative  the  parameters  i/r  =  (p,  cr). 

a.  Determine  gradient  and  Hessian  matrix  of  the 
log-likelihood  function  log  (L( i/r) ). 

b.  Check  that  the  maximum  likelihood  estimates 
are  invariant  under  this  change  of  parameters. 
In  particular,  show  that  the  estimated  value  of  cr 
is  the  square  root  of  the  estimated  value  of  cr2  in 
Example  1.8. 

c.  Check  the  equality  in  (1.46)  for  the  log-likeli¬ 
hood  function  log  (L(i/r) ). 

1.7  (“®  Sections  1.3.2,  1.3.3) 

a.  Prove  the  equality  in  (1.45)  for  arbitrary  distri¬ 
butions. 

b.  Prove  the  inequality  of  Chebyshev,  which  states 
that  for  a  random  variable  y  with  mean  p  and 
variance  cr2  there  holds  P[|y  —  p\  >  ccr]  <  1/c2 
for  every  c  >  0. 

c.  Use  the  inequality  of  Chebyshev  to  prove  that 
the  two  conditions  in  (1.48)  imply  consistency. 

d.  Use  this  result  to  prove  that  the  maximum  likeli¬ 
hood  estimators  pML  and  in  Example  1.8 
are  consistent. 

e*.  Prove  the  four  rules  for  probability  limits  that 
are  stated  in  the  text  below  formula  (1.48)  —  that 


is,  for  the  sum,  the  product,  the  quotient,  and 
arbitrary  continuous  functions  of  sequences  of 
random  variables. 

1.8  (“©  Sections  1.3.2,  1.3.3) 

Let  yi,  •  •  • ,  y„  be  a  random  sample  from  a  Bernoulli 
distribution. 

a.  Derive  the  maximum  likelihood  estimator  of  the 
parameter  p  =  P[y,  =  1].  Investigate  whether 
this  estimator  is  unbiased  and  consistent. 

b.  Derive  the  Cramer-Rao  lower  bound. 

c.  Investigate  whether  the  estimator  in  a  is  efficient 
in  the  class  of  unbiased  estimators. 

d.  Suppose  that  the  odds  ratio  P[y  =  1]/P[y  =  0]  is 
estimated  by  p/(l  —  p)  with  p  the  estimator  in  a. 
Investigate  whether  this  estimator  of  the  odds 
ratio  is  unbiased  and  consistent. 

1.9  (•«>  Sections  1.3.1,  1.3.2) 

Let  y,  ~  IID(/t,  cr2),  i  =  1,  •  •  • ,  ra,  and  consider 
linear  estimators  of  the  mean  p  of  the  form 

p  =  E,-= i  am- 

a.  Derive  the  restriction  on  a,  needed  to  guarantee 
that  the  estimator  p  is  unbiased. 

b.  Derive  the  expression  for  the  variance  of  the 
estimator  p. 

c.  Derive  the  linear  unbiased  estimator  that  has  the 
minimal  variance  in  this  class  of  estimators. 

d.  Suppose  that  the  distribution  of  each  observation 

is  given  by  the  double  exponential  density 
f(v)  =  with  —oo  <  v  <  oo.  Show  that  in 

this  case  the  maximum  likelihood  estimator  of  p 
is  the  median. 

e.  Discuss  whether  the  estimator  of  c  will  be  asymp¬ 
totically  efficient  in  the  class  of  all  unbiased  esti¬ 
mators  if  the  data  are  generated  by  the  double 
exponential  density. 

1.10  (-»  Sections  1.3.1,  1.3.2,  1.3.3) 

Let  yi,  •  •  •  ,y„  be  a  random  sample  from  a  popula¬ 
tion  with  density  function  fo(v)  =  e°~v  for  v  >  6  and 
f(v)  =  0  for  v  <  9,  where  9  is  an  unknown  param¬ 
eter. 

a.  Determine  the  method  of  moments  estimator  of 
9,  based  on  the  first  moment.  Determine  the 
mean  and  variance  of  this  estimator. 

b.  Prove  that  this  estimator  is  consistent. 
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c.  Prove  that  the  maximum  likelihood  estimator  of 
6  is  given  by  the  minimum  value  of  yi,  •  •  •  ,y„. 
Give  explicit  proofs  that  this  estimator  is  biased 
but  consistent. 


d.  Discuss  which  of  the  two  estimators  in  a  and  c 
you  would  prefer.  Consider  in  particular  the  two 
extreme  cases  of  a  single  observation  (n  =  1)  and 
the  asymptotic  case  (for  n  — >  oo). 


EMPIRICAL  AND  SIMULATION  QUESTIONS 

1.11  (=©  Sections  1.1.1,  1.1.2,  1.2.2) 

In  this  exercise  we  consider  data  of  ten 
randomly  drawn  students  (the  observa¬ 
tion  index  i  indicates  the  position  of  the 
students  in  the  file  of  all  609  students  of  Example 
1.1).  The  values  of  FGPA,  SATM,  and  FEM  of  these 
students  are  as  follows. 


i 

FGPA 

SATM 

FEM 

8 

3.168 

6.2 

0 

381 

3.482 

5.4 

0 

186 

2.592 

5.7 

1 

325 

2.566 

6.0 

0 

290 

2.692 

5.9 

1 

138 

1.805 

5.4 

1 

178 

2.264 

6.2 

1 

108 

3.225 

6.0 

1 

71 

2.074 

5.3 

0 

594 

3.020 

6.2 

1 

a.  Compute  for  each  of  the  three  variables  the 
sample  mean,  median,  standard  deviation,  skew¬ 
ness,  and  kurtosis. 

b.  Compute  the  sample  covariances  and  sample  cor¬ 
relation  coefficients  between  these  three  variables. 

c.  Make  three  histograms  and  three  scatter  plots  for 
these  three  variables. 

d.  Relate  the  outcomes  in  a  and  b  with  the  results 
in  c. 

e.  Compute  the  conditional  means  of  FGPA  and 
SATM  for  the  four  male  students  and  also  for 
the  six  female  students.  Check  the  relation  (1.15) 
(applied  to  the  ‘population’  of  the  ten  students 
considered  here)  for  the  two  variables  FGPA  and 
SATM. 

1.12  (■*>  Sections  1.3.3,  1.4.2) 

Consider  the  data  set  of  ten  observations 
used  in  Exercise  1.11.  The  FGPA  scores 
are  assumed  to  be  independently  nor¬ 


mally  distributed  with  mean  p  and  variance  a2, 
and  the  gender  variable  FEM  is  assumed  to 
be  independently  Bernoulli  distributed  with  param¬ 
eter  p  =  P[FEM  =  1].  These  ten  students  are  actu¬ 
ally  drawn  from  a  larger  data  set  consisting  of  236 
female  and  373  male  students  where  the  FGPA 
scores  have  mean  2.79  and  standard  deviation  0.46. 

a.  On  the  basis  of  the  ten  observations,  test  the  null 
hypothesis  that  the  mean  is  p  =  2.79  against  the 
alternative  that  p  <  2.79.  Use  a  statistical  pack¬ 
age  to  compute  the  corresponding  (one-sided) 
P-value  of  this  test  outcome. 

b.  Repeat  a,  but  now  for  the  two-sided  alternative 
that  p  /  2.79.  What  is  the  relation  between  the 
P- values  of  the  one-sided  and  the  two-sided  tests? 

c.  Answer  the  questions  in  a  and  b  for  testing  the 
null  hypothesis  a  =  0.46  against  the  one-sided 
alternative  a  >  0.46. 

d.  Let  p  denote  the  random  variable  consisting  of 
the  fraction  of  successes  in  a  random  sample  of 
size  n  from  the  Bernoulli  distribution.  Use  the 
central  limit  theorem  to  argue  that  the  approxi¬ 
mate  distribution  of  p  is  p  «  N(p,  ytp(  1  —  p)). 

e.  In  the  sample  there  are  six  female  and  four  male 

students.  Test  the  null  hypothesis  that 

p  =  236/609  against  the  alternative  that 
p  ^  236/609,  based  on  the  asymptotic  approxi¬ 
mation  in  d. 

1.13  (-*>  Sections  1.1.1,  1.1.2,  1.2.3, 

1.4.3) 

In  this  exercise  we  consider  data  of  474 
employees  (working  in  the  banking 
sector)  on  the  variables  y  (the  yearly  salary  in  dollars) 
and  x  (the  number  of  finished  years  of  education). 

a.  Make  histograms  of  the  variables  x  and  y  and 
make  a  scatter  plot  of  y  against  x. 

b.  Compute  mean,  median,  and  standard  deviation 
of  the  variables  x  and  y  and  compute  the  correl¬ 
ation  between  x  and  y.  Check  that  the  distribu¬ 
tion  of  the  salaries  y  is  very  skewed  and  has 
excess  kurtosis. 
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c.  Compute  a  95%  interval  estimate  of  the  mean  of 
the  variable  y,  assuming  that  the  salaries  are 
NID(/f  ,a2). 

d.  Define  the  random  variable  z  =  log(y).  Make 
a  histogram  of  the  resulting  474  values  of  z 
and  compute  the  mean,  median,  standard  devi¬ 
ation,  skewness  and  kurtosis  of  z.  Check  that 
z^log(y)  but  that  med(z)  =  log  (med(y)). 
Explain  this  last  result. 

e.  Compute  a  95%  interval  estimate  of  the  mean  of 
the  variable  z,  assuming  that  the  observations  on 
z  are  NID(/t., a1,). 

f.  If  z  ~  N then  y  =  ez  is  said  to  be  log- 
normally  distributed.  Show  that  the  mean  of  y 
is  given  by  /(  =  eft+i'7. 

g.  Compute  a  95%  interval  estimate  of  /(,  based  on 
the  results  in  d,  e,  and  f.  Compare  this  interval 
with  that  obtained  in  c.  Which  interval  do  you 
prefer? 

1.14  ('’§?  Section  1.4.3) 

In  this  simulation  exercise  we  consider  the  quality  of 
the  asymptotic  interval  estimates  discussed  in 
Section  1.4.3.  As  data  generating  process  we  con¬ 
sider  the  f(3)  distribution  that  has  mean  equal  to 
zero  and  variance  equal  to  three.  We  focus  on  the 
construction  of  interval  estimates  of  the  mean  and 
on  corresponding  tests. 

a.  Generate  a  sample  of  n  =  10  independent  draw¬ 
ings  from  the  f(3)  distribution.  Let  y  be  the 
sample  mean  and  s  the  sample  standard  devi¬ 
ation.  Compute  the  interval  y  ±2s/^/n.  Reject 
the  null  hypothesis  of  zero  mean  if  and  only  if 
this  interval  does  not  include  zero. 

b.  Repeat  the  simulation  run  of  a  10,000  times  and 
compute  the  number  of  times  that  the  null  hy¬ 
pothesis  of  zero  mean  is  rejected. 

c.  Repeat  the  simulation  experiment  of  a  and  b  for 
sample  sizes  n  =  100  and  n  =  1000  instead  of 
n  =  10. 

d.  Give  an  explanation  for  the  simulated  rejection 
frequencies  of  the  null  hypothesis  for  sample 
sizes  n  =  10,  n  =  100,  and  n  =  1000. 

1.15  (-»  Sections  1.2.3,  1.2.4,  1.4.3) 

In  this  simulation  exercise  we  consider  an  example 
of  the  use  of  the  bootstrap  in  constructing  an  inter¬ 


val  estimate  of  the  median.  If  the  median  is  taken  as 
a  measure  of  location  of  a  distribution  f,  this  can  be 
estimated  by  the  sample  median.  For  a  random 
sample  of  size  n,  the  sample  median  has  a  standard 
deviation  of  (2f(m)^/n)~1  where  m  is  the  median  of 
the  density  f.  When  the  distribution  f  is  unknown, 
this  expression  cannot  be  used  to  construct  an  inter¬ 
val  estimate. 

a.  Show  that  for  random  samples  from  the  normal 
distribution  the  standard  deviation  of  the  sample 
median  is  GsJnl\/7ji  whereas  the  standard  devi¬ 
ation  of  the  sample  mean  is  a  j \Jn.  Comment  on 
these  results. 

b.  Show  that  for  random  samples  from  the  Cauchy 
distribution  (that  is,  the  t(  1)  distribution)  the 
standard  deviation  of  the  sample  mean  does  not 
exist,  but  that  the  standard  deviation  of  the 
sample  median  is  finite  and  equal  to  7t/ (2 s/n). 

c.  Simulate  a  data  set  of  n  =  1000  observations 
yi,- ",71000  by  independent  drawings  from  the 
t(  1)  distribution. 

d.  Use  the  bootstrap  method  (based  on  the  data 
of  c)  to  construct  a  95%  interval  estimate  of 
the  median,  as  follows.  Generate  a  new  set 
of  1000  observations  by  IID  drawings  from 
the  bootstrap  distribution  and  compute  the 
median.  Repeat  this  10,000  times.  The  95% 
interval  estimate  of  the  median  can  be  obtained 
by  ordering  the  10,000  computed  sample 
medians.  The  lower  bound  is  then  the  251st 
value  and  the  upper  bound  is  the  9750th  value 
in  this  ordered  sequence  of  sample  medians  (this 
interval  contains  9500  of  the  10,000  medians  — 
that  is,  95%). 

e.  Compute  the  standard  deviation  of  the  median 
over  the  10,000  simulations  in  d,  and  compare 
this  with  the  theoretical  standard  deviation  in  b. 

f.  Repeat  c  10,000  times.  Construct  a  correspond¬ 
ing  95%  interval  estimate  of  the  median  and 
compare  this  with  the  result  in  d.  Also  compute 
the  standard  deviation  of  the  median  over  these 
10,000  simulations  and  compare  this  with  the 
result  in  b. 

g.  Comment  on  the  differences  between  the 
methods  in  d  and  f  and  their  usefulness  in  prac¬ 
tice  if  we  do  not  know  the  true  data  generating 
process. 
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Simple  Regression 


Econometrics  is  concerned  with  relations  between  economic  variables.  The 
simplest  case  is  a  linear  relation  between  two  variables.  This  relation  can  be 
estimated  by  the  method  of  least  squares.  We  discuss  this  method  and  we 
describe  conditions  under  which  this  method  performs  well.  We  also  describe 
tests  for  the  statistical  significance  of  models  and  their  use  in  making  predic¬ 
tions. 
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2.1  Least  squares 


2.1.1  Scatter  diagrams 

Uses  Sections  1.1.1,  1.1.2;  Appendix  A.l. 


Data 

Data  form  the  basic  ingredient  for  every  applied  econometric  study.  There¬ 
fore  we  start  by  introducing  three  data  sets,  one  taken  from  the  financial 
world,  the  second  one  from  the  field  of  labour  economics,  and  the  third  one 
from  a  marketing  experiment.  These  examples  are  helpful  to  understand  the 
methods  that  we  will  discuss  in  this  chapter. 


E 
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Example  2.1 :  Stock  Market  Returns 

Exhibit  2.1  shows  two  histograms  (in  (a)  and  (b))  and  a  scatter  plot  (in  (c))  of 
monthly  excess  returns  in  the  UK  over  the  period  January  1980  to  December 
1999.  The  data  are  taken  from  the  data  bank  DataStream.  One  variable, 
which  we  denote  by  y„  corresponds  to  the  excess  returns  on  an  index  of 
stocks  in  the  sector  of  cyclical  consumer  goods.  This  index  is  composed  on 
the  basis  of  104  firms  in  the  areas  of  household  durables,  automobiles, 
textiles,  and  sports.  The  consumption  of  these  goods  is  relatively  sensitive 
to  economic  fluctuations,  for  which  reason  they  are  said  to  be  cyclical.  The 
other  variable,  which  we  denote  by  xt,  corresponds  to  the  excess  returns  on 
an  overall  stock  market  index.  The  index  i  denotes  the  observation  number 
and  runs  from  i  =  1  (for  1980.01)  to  i  =  240  (for  1999.12).  The  excess 
returns  are  obtained  by  subtracting  the  return  on  a  riskless  asset  from  the 
asset  returns.  Here  we  used  the  one-month  interest  rate  as  riskless  asset.  In 
Exhibit  2.1,  x,  is  denoted  by  RENDMARK  and  y;  by  RENDCYCO  (see 
Appendix  B  for  an  explanation  of  the  data  sets  and  corresponding  notation 
of  variables  used  in  the  book). 

The  histograms  indicate  that  the  excess  returns  in  the  sector  of  cyclical 
consumer  goods  are  on  average  lower  than  those  in  the  total  market  and 
that  they  show  a  relatively  larger  variation  over  time.  The  extremely  large 
negative  returns  (of  around  —36  per  cent  and  —28  per  cent)  correspond 
to  the  stock  market  crash  in  October  1987.  The  scatter  diagram  shows 
that  the  two  variables  are  positively  related,  since  the  top  right  and 
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RENDMARK 

Exhibit  2.1  Stock  Market  Returns  (Example  2.1) 

Histograms  of  monthly  returns  in  the  sector  of  cyclical  consumer  goods  (a)  and  monthly  total 
market  returns  (b)  in  the  UK,  and  corresponding  scatter  diagram  (c). 

bottom  left  of  the  diagram  contain  relatively  many  observations.  However, 
the  relationship  is  not  completely  linear.  If  we  were  to  draw  a  straight 
line  through  the  scatter  of  points,  then  there  would  be  clear  deviations 
from  the  line. 

The  further  analysis  of  these  data  is  left  as  an  exercise  (see  Exercises  2.11, 
2.12,  and  2.15). 


Example  2.2:  Bank  Wages 

Exhibit  2.2  shows  two  histograms  (in  (a)  and  ( b ))  and  a  scatter  diagram  (in 
(c))  of  474  observations  on  education  (in  terms  of  finished  years  of  education) 
and  salary  (in  (natural)  logarithms  of  the  yearly  salary  S  in  dollars).  The 
salaries  are  measured  in  logarithms  for  reasons  to  be  discussed  later  (see 
Example  2.6).  The  data  are  taken  from  one  of  the  standard  data  files  of  the 
statistical  software  package  SPSS  and  concern  the  employees  of  a  US  bank. 
Each  point  in  the  scatter  diagram  corresponds  to  the  education  and  salary  of 
an  employee.  On  average,  salaries  are  higher  for  higher  educated  people. 
However,  for  a  fixed  level  of  education  there  remains  much  variation  in 
salaries.  This  can  be  seen  in  the  scatter  diagram  (c),  as  for  a  fixed  value  of 
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Exhibit  2.2  Bank  Wages  (Example  2.2) 

Histograms  of  education  (in  years  (a))  and  salary  (in  logarithms  (b))  of  474  bank  employees, 
and  corresponding  scatter  diagram  (c). 


education  (on  the  horizontal  axis)  there  remains  variation  in  salaries  (on  the 
vertical  axis). 

In  the  sequel  we  will  take  this  as  the  leading  example  to  illustrate  the 
theory  of  this  chapter. 


Example  2.3:  Coffee  Sales 

Exhibit  2.3  shows  a  scatter  diagram  with  twelve  observations  (x„  yt)  on  price 
and  quantity  sold  of  a  brand  of  coffee.  The  data  are  taken  from  A.  C. 
Bemmaor  and  D.  Mouchoux,  ‘Measuring  the  Short-Term  Effect  of  In-Store 
Promotion  and  Retail  Advertising  on  Brand  Sales:  A  Factorial  Experiment’, 
Journal  of  Marketing  Research,  28  (1991),  202-14,  and  were  obtained  from 
a  controlled  marketing  experiment  in  stores  in  Paris.  The  price  is  indexed, 
with  value  one  for  the  usual  price.  Two  price  actions  are  investigated,  with 
reductions  of  5  per  cent  or  15  per  cent  of  the  usual  price.  The  quantity  sold  is 
in  units  of  coffee  per  week.  Clearly,  lower  prices  result  in  higher  sales. 
Further,  for  a  fixed  price  (on  the  horizontal  axis)  there  remains  variation  in 
sales  (different  values  on  the  vertical  axis). 

The  further  analysis  of  these  data  is  left  as  an  exercise  (see  Exercise  2.10). 
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Exhibit  2.3  Coffee  Sales  (Example  2.3) 


Scatter  diagram  of  quantity  sold  against  price  index  of  a  brand  of  coffee  (the  data  set  consists  of 
twelve  observations;  in  the  top  left  part  of  the  diagram  two  observations  nearly  coincide). 


2.1.2  Least  squares 

Uses  Appendix  A.  1,  A.  7. 

Fitting  a  line  to  a  scatter  of  data 

Our  starting  point  is  a  set  of  points  in  a  scatter  diagram  corresponding  to  n 
paired  observations  (x,-,  y,),  i  =  1,  ■  ■  • ,  n,  and  we  want  to  find  the  line  that 
gives  the  best  fit  to  these  points.  We  describe  the  line  by  the  formula 


y  =  a  +  bx.  (2.1) 

Here  b  is  called  the  slope  of  the  line  and  a  the  intercept.  The  idea  is  to  explain 
the  differences  in  the  outcomes  of  the  variable  y  in  terms  of  differences  in  the 
corresponding  values  of  the  variable  x.  To  evaluate  the  fit,  we  assume  that 
our  purpose  is  to  explain  or  predict  the  value  of  y  that  is  associated  with  a 
given  value  of  x.  In  the  three  examples  in  Section  2.1.1,  this  means  that  the 
monthly  variation  in  sector  returns  is  explained  by  the  market  returns,  that 
differences  in  salaries  are  explained  by  education,  and  that  variations  in  sales 
are  explained  by  prices. 

Terminology 

The  variable  y  in  (2.1)  is  called  the  variable  to  be  explained  (or  also 
the  dependent  variable  or  the  endogenous  variable)  and  the  variable  x  is 
called  the  explanatory  variable  (or  also  the  independent  variable,  the  exogen¬ 
ous  variable,  the  regressor,  or  the  covariate ).  We  measure  the  deviations  e,  of 
the  observations  from  the  line  vertically  —  that  is, 


80  2  Simple  Regression 


Exhibit  2.4  Scatter  diagram  with  fitted  line 

Scatter  diagram  with  observed  data  (x,,  y,),  regression  line  (a  +  bx),  fitted  value  ( a  +  bx,),  and 
residual  (e,). 


e,  =  yi  —  a  —  bxj.  (2.2) 

So  e,  is  the  error  that  we  make  in  predicting  y,  by  means  of  the  variable  x, 
using  the  linear  relation  (2.1)  (see  Exhibit  2.4). 

The  least  squares  criterion 

Now  we  have  to  make  precise  what  we  mean  by  the  fit  of  a  line.  We  will  do 
this  by  specifying  a  criterion  function  —  that  is,  a  function  of  a  and  b  that 
takes  smaller  values  if  the  deviations  are  smaller.  In  many  situations  we 
dislike  positive  deviations  as  much  as  negative  deviations,  in  which  case  the 
criterion  function  depends  on  a  and  b  via  the  absolute  values  of  the  devi¬ 
ations  e,.  There  are  several  ways  to  specify  such  a  function  —  for  instance, 

‘-'absl^T  b)  ^ 

S(a,  b )  = 

In  both  cases  the  summation  index  runs  from  1  through  n  (where  no 
misunderstanding  can  arise  we  simply  write  J2)-  The  second  of 
these  functions,  the  least  squares  criterion  that  measures  the  sum  of  squared 
deviations,  is  by  far  the  most  frequently  used.  The  reason  is  that  its  minimiza¬ 
tion  is  much  more  convenient  than  that  of  other  functions.  This  method  is 
also  called  ordinary  least  squares  (abbreviated  as  OLS).  However,  we  will 
meet  other  criterion  functions  in  later  chapters.  In  this  chapter  we  restrict  our 
attention  to  the  minimization  of  S[a,  b). 

Computation  of  the  least  squares  estimates  a  and  b 

By  substituting  (2.2)  in  the  least  squares  criterion  we  obtain 


S(a,  b)  =  {yi  —  a  —  bx ,)2. 


(2.3) 
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Here  the  n  observations  (x„  y,)  are  given,  and  we  minimize  the  function 
S(a,  b)  with  respect  to  a  and  b.  The  first  order  conditions  for  a  minimum  are 
given  by 


—  =  -2  ^  (y,  -  u  -  bxi)  =  0,  (2.4) 

dS 

=  -2 'y]xi(yi  -a-  bxj)  =  0.  (2.5) 

From  the  condition  in  (2.4)  we  obtain,  after  dividing  by  2 n,  that 

a  =  y  —  bx  (2.6) 

where  y  =  J2y</n  an< d  x  =  Y^xi/n  denote  the  sample  means  of  the  variables 
y  and  x  respectively.  Because  x  is  fixed  (that  is,  independent  of  i)  so  that 
—  a  —  bxj)  =  x^2  (jj  —  a  —  bx, )  =  0  according  to  (2.4),  we  can  re¬ 
write  (2.5)  as 


Y,  (xj  —  x)(yj  -  a-  bx,)  =  0.  (2.7) 

Now  we  substitute  (2.6)  in  (2.7)  and  solve  this  expression  for  b,  so  that 


b  =  ~x)(y,  -y) 

E  (xi  ~  x)1 


(2.8) 


To  check  whether  the  values  of  a  and  b  in  (2.6)  and  (2.8)  indeed  provide  the 
minimum  of  S(a,  b),  it  suffices  to  check  whether  the  Hessian  matrix  is 
positive  definite.  From  (2.4)  and  (2.5)  the  Hessian  matrix  is  obtained  as 

(  d2S/da 2  d2S/dadb\  _  (  2 n  2  ^xA 
\d2S/dbda  d2S/db 2  )  \2^x,  )' 

This  matrix  is  positive  definite  if  n  >  0  and  the  determinant  4 n  E  xf 
— 4(E  Xi)2>  0  —  that  is, 

Y  (x,‘  _  x)2  >  0. 

The  condition  that  n  >  0  is  evident,  and  the  condition  that  E  ( x ,  —  x)2  >  0 
means  that  there  should  be  some  variation  in  the  explanatory  variable  x.  If 
this  condition  does  not  hold  true,  then  all  the  points  in  the  scatter  diagram 
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are  situated  on  a  vertical  line.  Of  course,  it  makes  little  sense  to  try  to  explain 
variations  in  y  by  variations  in  x  if  x  does  not  vary  in  the  sample. 

Normal  equations 

We  can  rewrite  the  two  first  order  conditions  in  (2.4)  and  (2.5)  as 

an +  b  Xj  =  ^  y„  (2.9) 

a  y  x, + b  y  xj  =  y  x,y,.  (2.10) 

These  are  called  the  normal  equations.  The  expressions  in  (2.6)  and  (2.8) 
show  that  the  least  squares  estimates  depend  solely  on  the  first  and  second 
(non-centred)  sample  moments  of  the  data. 

Remark  on  notation 

Now  that  we  have  completed  our  minimization  procedure,  we  make  a 
remark  on  the  notation.  When  minimizing  (2.3)  we  have  treated  the  x,  and 
y,  as  fixed  numbers  and  a  and  b  as  independent  variables  that  could  be  chosen 
freely.  After  completing  the  minimization  procedure,  we  have  found  specific 
values  of  a  and  b  by  (2.6)  and  (2.8).  Strict  mathematicians  would  stress  the 
difference  by  using  different  symbols.  From  now  on,  we  no  longer  need  a  and 
b  as  independent  variables  and  for  convenience  we  will  use  the  notation  a 
and  b  only  for  the  expressions  in  (2.6)  and  (2.8).  That  is,  from  now  on  a  and  b 
are  uniquely  defined  as  the  numbers  that  can  be  computed  from  the  observed 
data  (x„  y,),  i  =  1,  •  •  • ,  n,  by  means  of  these  two  formulas. 

Exercises:  T:  2. fa,  b;  E:  2.f Ob,  2.12a,  c. 


2.1.3  Residuals  and  R2 

Least  squares  residuals 

Given  the  observations  (xi,  yi),  ■  ■  ■ ,  (x„,  y„),  and  the  corresponding  unique 
values  of  a  and  b  given  by  (2.6)  and  (2.8),  we  obtain  the  residuals 

e,  =  y,-  -  a  -  bx,. 

Because  a  and  b  satisfy  the  first  order  conditions  (2.4)  and  (2.7),  we  find  two 
properties  of  these  residuals 

y,  e,  =  0,  y  (x,  -  x)e,  =  0. 


(2.11) 
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So  (in  the  language  of  descriptive  statistics)  the  residuals  have  zero  mean  and 
they  are  uncorrelated  with  the  explanatory  variable. 

Three  sums  of  squares 

A  traditional  way  to  measure  the  performance  of  least  squares  is  to  compare 
the  sum  of  squared  residuals  with  the  sum  of  squares  of  (y,  —  y).  We  can 
rewrite  (2.2)  as 


y,  =  a  +  bx,  +  e, 
and  we  obtain  from  (2.6)  that 

y,-y  =  b(x,  —  x)  +  e,. 

So  the  difference  from  the  mean  (y,  —  y)  can  be  decomposed  as  a  sum  of  two 
components,  a  component  corresponding  to  the  difference  from  the  mean  of 
the  explanatory  variable  {x,  —  x)  and  an  unexplained  component  described 
by  the  residual  e,.  The  sum  of  squares  of  (yt  —  y)  also  consists  of  two 
components 


^2(yi~y)2  =  b2^2(xi-x)2 +  ^ef  (2.12) 

SST  =  SSE  +  SSR.  (2.13) 

Note  that  the  cross  product  term  E  (xi  ~  x)ei  vanishes  as  a  consequence  of 
(2.11).  Here  SST  is  called  the  total  sum  of  squares,  SSE  the  explained  sum  of 
squares,  and  SSR  the  sum  of  squared  residuals. 

Coefficient  of  determination:  R2 

The  above  three  sums  of  squares  depend  on  the  scale  of  measurement  of  the 
variable  y.  To  get  a  performance  measure  that  is  independent  of  scale  we 
divide  through  by  SST.  The  coefficient  of  determination,  denoted  by  the 
symbol  R 2,  is  defined  as  the  relative  explained  sum  of  squares 

7  _  SSE  _b2Yl  (Xi  ~  X)2 
£  (y,  -  y)2  ' 

By  substituting  (2.8)  in  (2.14)  we  obtain 

pi  {Y,(xl~x){y,-y))1 
E  (xi  -  x )2  E  (yi  -  y )2 
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(2.14) 
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So  R 1  is  equal  to  the  square  of  the  correlation  coefficient  between  x  and  y.  By 
using  (2.12)  it  follows  that  (2.14)  can  be  rewritten  as 


R2  =  1 - ^  ! y .  (2.15) 

E  (y,-y)2 

The  expressions  (2.14)  and  (2.15)  show  that  0  <  R2  <  1  and  that  the  least 
squares  criterion  is  equivalent  to  the  maximization  of  R1. 


R2  in  model  without  intercept 

Until  now  we  have  assumed  that  an  intercept  term  (the  coefficient  a)  is 
included  in  the  model.  If  the  model  does  not  contain  an  intercept  —  that  is, 
if  the  fitted  line  is  of  the  form  y  =  bx  —  then  R1  is  still  defined  as  in  (2.14). 
However,  the  results  in  (2.11),  (2.12),  and  (2.15)  no  longer  hold  true  (see 
Exercise  2.4). 

Exercises:  T:  2.2;  E:  2.10c,  2.11. 


2.1.4  Illustration:  Bank  Wages 

We  illustrate  the  results  in  Sections  2.1.2  and  2.1.3  with  the  data  on  bank 
wages  discussed  before  in  Example  2.2.  We  will  discuss  (i)  the  precision  of 
reported  results  in  this  book,  (ii)  the  least  squares  estimates,  (iii)  the  sums  of 
squares  and  R2,  and  (iv)  the  outcome  of  a  regression  package  (we  used  the 
package  EViews). 

(i)  Precision  of  reported  results 

For  readers  who  want  to  check  the  numerical  outcomes,  we  first  comment  on 
the  precision  of  the  reported  results.  In  all  our  examples,  we  report  inter¬ 
mediary  and  final  results  with  a  much  lower  precision  than  the  software 
packages  used  to  compute  the  outcomes.  Therefore,  to  check  the  outcomes, 
the  reader  should  also  use  a  software  package  and  should  not  work  with  our 
intermediary  outcomes,  which  involve  rounding  errors. 

(ii)  Least  squares  estimates 

Continuing  the  discussion  in  Example  2.2  on  bank  wages,  we  report  the 
following  sample  statistics  for  the  n  =  474  observations  of  the  variables  x 
(education)  and  y  (natural  logarithm  of  salary). 
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y x;  =  6395,  y  y,-  = 4909 ’  y  =  90215 » 

Yyf  =  50917,  y  x,y,  =  66609. 

To  compute  the  slope  (2.8),  we  use 

y,  (Xi  -  x)(yt  -  y)  =  Y x>y>  ~  nxy  =  Y  xiVi  ~  ~  Y, x>  y 7*  =  378.9, 

y  (x, — x)2  =  yx2  —  nx1  =  yx2  -  -  (y  x,)  =  3936.5, 

so  that  b  =  0.096.  The  formula  (2.6)  for  the  intercept  gives  a  =  9.06.  We 
leave  it  as  an  exercise  to  check  that  these  values  satisfy  the  normal  equations 
(2.9)  and  (2.10)  (up  to  rounding  errors).  The  regression  line  is  given  by 
a  +  bx  =  9.06  +  0.096x  and  is  shown  in  Exhibit  2.5  (a).  In  the  sense  of 
least  squares,  this  line  gives  an  optimal  fit  to  the  cloud  of  points.  The 
histogram  of  the  residuals  is  shown  in  Exhibit  2.5  (b). 

(iii)  Sums  of  squares  and  R2 

The  sums  of  squares  are 

SST  =  y  yf  -  ny2  =  y  yf  -  \  (y  y)  = 74 

sse = bi  (y  w  -  nx?~) = b2  (y  -  \  (y  ) = 36j> 

SSR  =  SST  -  SSE  =  38.4, 

with  a  corresponding  coefficient  of  determination  R 2  =  0.49. 


(a) 


LOGSALARY  vs.  EDUC 


EDUC 


( b ) 


Exhibit  2.5  Bank  Wages  (Section  2.1.4) 


Scatter  diagram  of  salary  (in  logarithms)  against  education  with  least  squares  line  (a)  and 
histogram  of  the  residuals  (, b ). 


86  2  Simple  Regression 


(iv)  Outcome  of  regression  package 

The  outcome  of  a  regression  package  is  given  in  Exhibit  2.6.  The  two  values 
in  the  column  denoted  by  ‘coefficient’  show  the  values  of  a  (the  constant 
term)  and  b  (the  coefficient  of  the  explanatory  variable).  The  table  reports 
R2,  SSR,  and  also  y  and  SST/{r z  —  1)  (the  sample  mean  and  sample  stand¬ 
ard  deviation  of  the  dependent  variable).  We  conclude  that  there  is  an 
indication  of  a  positive  effect  of  education  on  salary  and  that  around  50 
per  cent  of  the  variation  in  (logarithmic)  salaries  can  be  attributed  to  differ¬ 
ences  in  education. 


Dependent  Variable:  LOGSALARY 

Method:  Least  Squares 

Sample:  1  474 

Variable 

Coefficient 

C 

9.062102 

EDUC 

0.095963 

R-squared 

Sum  squared  resid 

0.485447  Mean  dependent  var  10.35679 
38.42407  S.D.  dependent  var  0.397334 

Exhibit  2.6  Bank  Wages  (Section  2.1.4) 

Results  of  regression  of  salary  (in  logarithms)  on  a  constant  (denoted  by  C)  and  education, 
based  on  data  of  474  bank  employees. 
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2.2  Accuracy  of  least  squares 


2.2.1  Data  generating  processes 

Uses  Appendix  A.  1 . 

The  helpful  fiction  of  a  ‘true  model’  in  statistical  analysis 

Before  discussing  the  statistical  properties  of  least  squares,  we  pay  attention 
to  the  meaning  of  some  of  the  terminology  that  is  used  in  this  context.  This 
concerns  in  particular  the  meaning  of  ‘data  generating  process’  and  ‘true 
model’. 

Economic  data  are  the  outcome  of  economic  processes.  For  instance,  the 
stock  market  data  in  Example  2.1  result  from  developments  in  the  produc¬ 
tion  and  value  of  many  firms  (in  this  case  firms  in  the  sector  of  cyclical 
consumer  goods)  in  the  UK,  and  the  sales  data  in  Example  2.3  result  from 
the  purchase  decisions  of  many  individual  buyers  in  a  number  of  stores  in 
Paris.  The  reported  figures  may  further  depend  on  the  method  of  measure¬ 
ment.  For  the  stock  market  data,  this  depends  on  the  firms  that  are  included 
in  the  analysis,  and  for  the  sales  data  this  depends  on  the  chosen  shops  and 
the  periods  of  measurement.  It  is  common  to  label  the  combined  economic 
and  measurement  process  as  the  data  generating  process  (DGP).  An  econo¬ 
metric  model  aims  to  provide  a  concise  and  reasonably  accurate  reflection  of 
the  data  generating  process.  By  disregarding  less  relevant  aspects  of  the  data, 
the  model  helps  to  obtain  a  better  understanding  of  the  main  aspects  of  the 
data  generating  process.  This  implies  that,  in  practice,  an  econometric  model 
will  never  provide  a  completely  accurate  description  of  the  data  generating 
process.  Therefore,  if  taken  literally,  the  concept  of  a  ‘true  model’  does  not 
make  much  practical  sense.  Still,  in  discussing  statistical  properties,  we 
sometimes  use  the  notion  of  a  ‘true  model’.  This  reflects  an  idealized  situation 
that  allows  us  to  obtain  mathematically  exact  results.  The  idea  is  that  similar 
results  hold  approximately  true  if  the  model  is  a  reasonably  accurate  ap¬ 
proximation  of  the  data  generating  process. 

Simulation  as  a  tool  in  statistical  analysis 

The  ideal  situation  of  a  ‘true  model’  will  never  hold  in  practice,  but  we  can 
imitate  this  situation  by  means  of  computer  simulations.  In  this  case  the  data 
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are  generated  by  means  of  a  computer  program  that  satisfies  the  assumptions 
of  the  model.  Then  the  model  is  indeed  ‘true’,  as  the  data  generating  process 
satisfies  all  the  model  assumptions.  So,  for  illustrative  purposes,  we  will  start 
not  by  analysing  a  set  of  empirical  data,  but  by  generating  a  set  of  data 
ourselves.  For  that  purpose  we  shall  write  a  small  computer  program  in 
which  we  shall  carry  out  a  number  of  steps. 

Example  of  a  simulation  experiment 

We  start  by  choosing  a  value  for  the  number  n  of  observations  —  for  instance, 
n  =  20.  Then  we  fix  n  numbers  for  the  explanatory  variable  x  —  for  instance, 
x\  =  1,  X2  =  2,  •  •  •  ,x„  =  n.  We  choose  a  constant  term  a  —  say,  a  =  10  —  and 
a  slope  coefficient  /I  —  say,  /?  =  1 .  Finally  we  choose  a  value  a 1  for  the  variance 
of  the  disturbance  or  error  term  —  for  instance,  a 2  =  25.  Then  we  generate 
n  random  disturbances  £i,  •••,£„.  For  this  purpose  we  use  a  generator  of 
normally  distributed  random  numbers.  Many  computer  packages  contain 
such  a  generator.  As  the  computer  usually  generates  random  numbers  with 
zero  mean  and  unit  variance,  we  have  to  multiply  them  by  a  —  5  to  obtain 
disturbances  with  variance  a 1  =  25.  Finally  we  generate  values  for  the 
dependent  variable  according  to 

yi  =  a  +  pXj  +  £,  (i=l,  ■••,«).  (2.16) 

The  role  of  the  disturbances  is  to  ensure  that  our  data  points  are  around  the 
line  a  +  fix  instead  of  exactly  on  this  line.  In  practice,  simple  relations  like 
y,  =  a.  +  fix,  will  not  hold  exactly  true  for  the  observed  data,  and  the  disturb¬ 
ances  Sj  summarize  the  effect  of  all  the  other  variables  (apart  from  x,)  on  yt. 
This  completes  our  data  generating  process  (DGP). 

Use  of  simulated  data  in  statistical  analysis 

Now  consider  the  situation  of  an  econometrician  whose  only  information 
consists  of  a  data  set  (x„  y,),  i  =  1,  ■  •  • ,  n,  which  is  generated  by  this  process, 
but  that  this  econometrician  does  not  know  the  underlying  values  of  a,  fl,  a, 
and  £,,/'=  1,-  •  ■  ,n.  The  observed  data  are  partly  random  because  of  the 
effects  of  the  disturbance  terms  e,  in  (2.16).  If  this  econometrician  now 
applies  the  formulas  of  Section  2.1.2  to  this  data  set  to  compute  a  and  b, 
we  can  interpret  them  as  estimates  of  a  and  /I,  respectively,  and  we  can 
compare  them  with  the  original  values  of  a  and  /I,  which  are  known  to  us. 
Because  of  the  disturbance  terms,  the  outcomes  of  a  and  b  are  random  and  in 
general  a  ^  a  and  b  ^  fi.  The  estimates  are  accurate  if  they  do  not  differ 
much  from  the  values  of  a  and  /I  of  the  DGP.  So  this  experiment  is  useful  for 
assessing  the  accuracy  of  the  method  of  least  squares. 
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We  can  repeat  this  simulation,  say  m  times.  The  values  of  a  and  b  obtained 
in  the  ;th  simulation  run  are  denoted  by  Uj  and  bj,  j  =  1,  ■  ■  ■ ,  m.  The  accuracy 
of  least  squares  estimates  can  be  evaluated  in  terms  of  the  means  a  =  ^  dj/m 
and  b  =  J2bj/ m  and  the  mean  squared  errors 

MSEfo  =  -V  (bl  -  P)2,  MSEfl  =  -V  (a,  -  a)2. 
m  L '  m  L ' 


Example  2.4:  Simulated  Regression  Data 

We  will  consider  outcomes  of  the  above  simulation  experiment.  The  data 
are  generated  by  the  equation  (2.16)  with  n  =  20,  x,  =  i  for  i  —  1,  ■  ■  ■ ,  20, 
a  =  10,  p  =  1,  and  with  fii,  •  •  • ,  £20  a  random  sample  from  a  normal  distri¬ 
bution  with  mean  zero  and  variance  a1  =  25.  The  results  of  two  simulation 
runs  are  shown  in  Exhibit  2.1 .  As  the  two  series  of  disturbance  terms  are 
different  in  the  two  simulation  runs  (see  (b)  and  (c)),  the  values  of  the 
dependent  variable  are  also  different.  This  is  also  clear  from  the  two  scatter 
diagrams  in  Exhibit  2.7  (d)  and  (e).  As  a  result,  the  obtained  regression  line 
a  +  bx  is  different  in  the  two  simulation  runs. 

This  simulation  is  repeated  m  =  10,000  times.  Histograms  of  the  resulting 
estimates  a  and  b  are  in  Exhibit  2.8  (a)  and  (b).  The  means  of  the  outcomes 
are  close  to  the  values  a  =  10  and  P  =  1  of  the  DGP.  We  see  that  the  variation 
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YSYS  =  10+X 
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Y1  =  YSYS+EPS1 
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Y2  =  YSYS+EPS2 
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13.00000 
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12.42252 
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22.07215 

4.000000 

14.00000 
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15.97348 

-8.105329 

5.894671 

5.000000 

15.00000 

-4.845870 

10.15413 

-10.94552 

4.054479 
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16.00000 

-5.324765 
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16.57207 
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17.00000 

-0.929291 

16.07071 

6.152398 

23.15240 
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18.00000 

12.00469 

30.00469 

-5.419486 

12.58051 

9.000000 

19.00000 
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7.506255 

26.50625 

10.00000 

20.00000 

0.267976 
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25.85582 
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19.86301 
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26.65635 

12.00000 

22.00000 
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1.584349 

25.58435 

15.00000 

25.00000 
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31.56312 

2.285183 

27.28518 

16.00000 
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-2.506565 

23.49344 

13.43999 

39.43999 

17.00000 

27.00000 

-9.757085 

17.24291 

7.935698 

34.93570 

18.00000 

28.00000 

2.295170 

30.29517 

3.045441 

31.04544 

19.00000 

29.00000 

3.536910 
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24.43128 
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8.769024 

38.76902 

6.649318 

36.64932 

Exhibit  2.7  Stimulated  Regression  Data  (Example  2.4) 

Data  generated  by  y  =  10  +  x  +  e  with  s  ~  N(0,  25);  shown  are  two  simulations  of  sample 
size  20  (a). 
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(b) 
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Exhibit  2.7  ( Contd .) 

Graphs  corresponding  to  the  two  simulations  in  (a),  with  series  of  disturbances  (EPS1  and 
EPS2  ((b)-(c))),  and  scatter  diagrams  (of  y\  against  x  and  of  yi  against*)  and  fitted  regressions 
(Y1FIT  and  Y2FIT  ((d)-(e))). 
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0.0367304 

Exhibit  2.8  Simulated  Regression  Data  (Example  2.4) 

Histograms  of  least  squares  estimates  (a  in  (a)  and  b  in  (b))  in  10,000  simulations  (the  DGP  has 
a  =  10  and  /?  =  1),  and  mean  squared  error  of  these  estimates  (c). 


2.2  Accuracy  of  least  squares  91 


of  the  outcomes  in  b  (measured  by  the  standard  deviation)  is  much  smaller 
than  that  of  a ,  and  the  same  holds  true  for  the  MSE  (see  (c)).  Intuitively 
speaking,  the  outcomes  of  the  slope  estimates  b  differ  significantly  from  zero. 
This  will  be  made  more  precise  in  Section  2.3.1,  where  we  discuss  the 
situation  of  a  single  data  set,  which  is  the  usual  situation  in  practice. 


2.2.2  Examples  of  regression  models 

Notation:  we  do  not  know  Greek  but  we  can  compute  Latin 

One  of  the  virtues  of  the  computer  experiment  in  the  foregoing  section  is  that 
it  helps  to  explain  the  usual  notation  and  terminology.  We  follow  the  con¬ 
vention  to  denote  the  parameters  of  the  DGP  by  Greek  letters  (a,  P,  a1)  and 
the  estimates  by  Latin  letters  {a,  b,  s2).  When  we  analyse  empirical  data  we 
do  not  know  ‘true’  values  of  a  and  P,  but  we  can  compute  estimates  a  and  b 
from  the  observed  data. 

Example  2.5:  Stock  Market  Returns  (continued) 

A  well-known  model  of  financial  economics,  the  capital  asset  pricing  model 
(CAPM),  relates  the  excess  returns  x,  (on  the  market)  and  y,  (of  an  individual 
asset  or  a  portfolio  of  assets  in  a  sector)  by  the  model  (2.16).  So  the  CAPM 
assumes  that  the  data  in  Example  2.1  in  Section  2.1.1  are  generated 
by  the  linear  model  y,-  =  a  +  fix,  +  e,  for  certain  (unknown)  values  of  a  and 
p.  The  disturbance  terms  e,  are  needed  because  the  linear  dependence  be¬ 
tween  the  returns  is  only  an  approximation,  as  is  clear  from  Exhibit  2.1.  It  is 
one  of  the  tasks  of  the  econometrician  to  estimate  a  and  p  as  well  as  possible. 
The  formulas  for  a  and  b  in  (2.6)  and  (2.8)  can  be  used  for  this  purpose.  The 
residuals  e,  in  (2.2)  can  be  seen  as  estimates  of  the  disturbances  £,  in  (2.16). 
The  further  analysis  of  this  data  set  is  left  as  an  exercise  (see  Exercises  2.11, 
2.12,  and  2.15). 

Example  2.6:  Bank  Wages  (continued) 

We  consider  again  Example  2.2  and  assume  that  the  model  (2.16)  applies  for 
the  data  on  logarithmic  salary  and  education.  Let  S  be  the  salary;  then  the 
model  states  that  y,  =  log  (Sp  =  a  +  Pxj  +  e,.  Here  P  can  be  interpreted  as  the 
relative  increase  in  salary  due  to  one  year  of  additional  education,  which 
is  given  by  —t \ ^  ■  In  the  model  (2.16)  this  derivative  is  assumed  to 
be  constant.  A  careful  inspection  of  Exhibit  2.2  (c)  may  cast  some  doubt  on  this 
assumption,  but  for  the  time  being  we  will  accept  it  as  a  working  hypothesis. 
Again,  the  disturbance  terms  e,  are  needed  because  the  linear  dependence 
between  the  y,  and  xt  is  only  an  approximation.  For  every  individual  there 
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will  be  many  factors,  apart  from  education,  that  affect  the  salary.  In  the  next 
chapter  we  will  introduce  some  of  these  factors  explicitly.  This  data  set  is 
further  analysed  in  Example  2.9  (p.  102)  and  in  Example  2.11  (p.  107). 

Example  2.7:  Coffee  Sales  (continued) 

The  data  on  prices  and  quantities  sold  of  Example  2.3  show  that  demand 
increases  if  the  price  is  decreased.  If  this  effect  is  supposed  to  be  proportional 
to  the  price  decrease,  then  the  demand  curve  can  be  described  by  the  model 

(2.16)  with  y,  for  the  quantity  sold  and  with  xt  for  the  price.  The  scatter 
diagram  in  Exhibit  2.3  clearly  shows  that,  for  fixed  prices,  the  sales  still 
fluctuate  owing  to  unobserved  causes.  The  variations  in  sales  that  are  not 
related  to  price  variations  are  summarized  by  the  disturbance  terms  e,  in 

(2.16) .  The  analysis  of  this  data  set  is  left  as  an  exercise  (see  Exercise  2.10). 


2.2.3  Seven  assumptions 

Uses  Sections  1.2.1,  1.2.3,  1.2.4. 


The  purpose  of  assumptions:  simpler  analysis 

Data  generating  experiments  as  described  in  Section  2.2.1  are  often  per¬ 
formed  in  applied  econometrics,  in  particular  in  complicated  cases  where 
little  is  known  about  the  accuracy  of  the  estimation  procedures  used.  In  the 
case  of  the  linear  model  and  the  method  of  least  squares,  however,  we  can 
obtain  accuracy  measures  by  means  of  analytical  methods.  For  this  purpose, 
we  introduce  the  following  assumptions  on  the  data  generating  process. 

Assumption  on  the  regressors 

•  Assumption  1:  fixed  regressors.  The  n  observations  on  the  explanatory 
variable  x\,  ■  ■  ■ ,  x„  are  fixed  numbers.  They  satisfy  (xj  —  x)2  >  0. 

This  means  that  the  values  x,  of  the  explanatory  variable  are  assumed  to  be 
non-random.  This  describes  the  situation  of  controlled  experiments.  For 
instance,  the  price  reductions  in  Example  2.3  were  performed  in 
a  controlled  marketing  experiment.  However,  in  economics  the  possibilities 
for  experiments  are  often  quite  limited.  For  example,  the  data  on  salaries  and 
education  in  Example  2.2  are  obtained  from  a  sample  of  individuals.  Here 
the  x  variable,  education,  is  not  determined  by  a  controlled  experiment.  It  is 
influenced  by  many  factors  —  for  instance,  the  different  opportunities  and 
situational  characteristics  of  the  individuals  —  and  these  factors  are  not 
observed  in  this  sample. 
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Assumptions  on  the  disturbances 

•  Assumption  2:  random  disturbances,  zero  mean.  The  n  disturbances 
El,  • • • ,  £„  are  random  variables  with  zero  mean,  £[£,]  =  0  (i  =  1,  •  •  • ,  n). 

•  Assumption  3:  homoskedasticity.  The  variances  of  the  n  disturbances 
£!,•■■,£„  exist  and  are  all  equal,  E[af]  =  a1  (i  =  1,  ■  •  ■ ,  n). 

•  Assumption  4:  no  correlation.  All  pairs  of  disturbances  (e„  £;)  are  uncor¬ 
related,  £[£,£;]  =  0  (i,  j  =  1,  •  •  ■  ,n,  i  ^  j). 

Assumptions  2-4  concern  properties  of  the  disturbance  terms.  Note  that 
they  say  nothing  about  the  shape  of  the  distribution,  except  that  extreme 
distributions  (such  as  the  Cauchy  distribution)  are  excluded  because  it  is 
assumed  that  the  means  and  variances  exist.  When  the  variances  are  equal 
the  disturbances  are  called  homoskedastic,  and  when  the  variances  differ 
they  are  called  beteroskedastic.  Assumption  4  is  also  called  the  absence  of 
serial  correlation  across  the  observations. 

Assumptions  on  model  and  model  parameters 

•  Assumption  5:  constant  parameters.  The  parameters  a,  fi,  and  a  are  fixed 
unknown  numbers  with  a  >  0. 

This  means  that,  although  the  parameters  of  the  DGP  are  unknown,  we 
assume  that  all  the  n  observations  are  generated  with  the  same  values  of 
the  parameters. 

•  Assumption  6:  linear  model.  The  data  on  y\,  ■  ■  ■ ,  yn  have  been  generated  by 

y,  =  ct  +  flxl  +  £I  (»  =  1,  •••,«).  (2.17) 

The  model  is  called  linear  because  it  postulates  that  y,  depends  in  a  linear  way 
on  the  parameters  a  and  /?.  Together  with  Assumptions  1-4,  it  follows  that 

E[y,\  =  a  +  fix,,  var(y,)  =  a1,  cov(y„  yf)  =  0  (i  ^  /). 

So  the  observed  values  of  the  dependent  variable  are  uncorrelated  and  have 
the  same  variance.  However,  the  mean  value  of  y,  varies  across  the  observa¬ 
tions  and  depends  on  xt. 

Assumption  on  the  probability  distribution 

•  Assumption  7:  normality.  The  disturbances  £i,  •  •  • ,  £„  are  jointly  normally 
distributed. 

Together  with  Assumptions  2-4,  this  assumption  specifies  a  precise  distribu¬ 
tion  for  the  disturbance  terms  and  it  implies  that  the  disturbances  are 
mutually  independent. 
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Interpretation  of  the  simple  regression  model 

Under  Assumptions  1-7,  the  values  y,  are  normally  and  independently  dis¬ 
tributed  with  varying  means  a  +  fix,  and  constant  variance  a2 .  This  can  be 
written  as 


y,  ~  NID(a  +  fixj,  a2)  (i  =  1,  •  •  ■ ,  n). 

If  fi  =  0,  this  reduces  to  the  case  of  random  samples  from  a  fixed  population 
described  in  Chapter  1.  The  essential  characteristic  of  the  current  model  is 
that  variations  in  y,  are  not  seen  purely  as  the  effect  of  randomness,  but  partly 
as  the  effect  of  variations  in  the  explanatory  variable  x,. 

Several  of  the  results  to  be  given  below  are  proved  under  Assumptions  1-6, 
but  sometimes  we  also  need  Assumption  7.  This  is  the  case,  for  instance,  in 
Section  2.3.1,  when  we  test  whether  the  estimated  slope  parameter  b  is 
significantly  different  from  zero. 

'’a?  Exercises:  E:  2.10a. 


2.2.4  Statistical  properties 

Uses  Sections  1.2.1,  1.3.2;  Appendix  A.  1. 


Derivation:  Some  helpful  notation  and  results 

Using  Assumptions  1-6  we  now  derive  some  statistical  properties  of  the  least 
squares  estimators  a  and  b  as  defined  in  (2.6)  and  (2.8).  For  this  purpose,  it 
is  convenient  to  express  the  random  part  of  a  and  b  as  explicit  functions 
of  the  random  variables  £,,  as  the  properties  of  these  disturbances  are  given 
by  Assumptions  2-4.  To  express  b  in  (2.8)  in  terms  of  £,-,  first  note  that 


(xj  —  x)y  =  y  Y  (xj  —  x)  =  0,  y~"^  (xj  —  x)x  =  x ^  (x,-  —  x)  =  0.  (2.18) 

Using  this  result,  (2.8)  can  be  written  as 


b  =  E  (*»  -  *)y< 

E  (Xi  -  x)Xj  ' 


Because  of  Assumption  6  we  may  substitute  (2.17)  for  y„  and  by  using 
E  (x,  —  x)tx  =  0  it  follows  that 


b 


p  |  E  (*<  -  x)e, 

E  (*i  -  *)Xi 


fi  y  y '  Cj£j 


(2.19) 


2.2  Accuracy  of  least  squares  95 


where  the  coefficients  c,  are  non-random  (because  of  Assumption  1)  and  given 
by 


X,  -  X  _  Xi  -  X 

E  (*»  -  x)Xi  J2  (Xi  ~  x)2  ' 


(2.20) 


To  express  a  in  (2.6)  in  terms  of  e„  (2.17)  implies  that  y  =  a  +  fix  +  e  (where 
e  =  \  E  £/)  and  (2.19)  that  bx  =  px  +  x  E  This  shows  that 

j  _  _  _ 

a  =  y  —  bx  =  tx  +  -  ^  s,-  —  x  ^  c,s,  =  a  +  ^  d,£,-  (2.21) 

where  the  coefficients  dj  are  non-random  and  given  by 


,  1  _  1  x(x,  —  x) 

dj  —  X  C-  j  —  y . 

n  «  E  (*»  -  *) 


From  (2.20)  and  (2.22)  we  directly  obtain  the  following  properties: 

1 


E*  =  1.  E42  =  ;- 


E  (*»  -  * 


^\2  ■ 


x2 


E  (*<  -  * 


tt\2  ■ 


(2.22) 


(2.23) 

(2.24) 


Least  squares  is  unbiased 

If  we  use  the  rules  of  the  calculus  of  expectations  (see  Section  1.2),  it  follows 
from  (2.19)  that 


E[b]  =  E  P  +  ^2  ci£i  =  P  + 


(2.25) 


because  ft  is  non-random  (Assumption  5),  the  ct  are  non-random 
(Assumption  1),  and  £[e,]  =  0  (Assumption  2).  Summarizing,  under  Assump¬ 
tions  1,  2,  5,  and  6  the  estimator  b  has  expected  value  ft  and  hence  b  is  an 
unbiased  estimator  of  ft.  Under  the  same  assumptions  we  get  from  (2.21) 


E[a  J 


(2.26) 


so  that  a  is  also  an  unbiased  estimator  of  a.  So  the  least  squares  estimates  will, 
on  average,  be  equal  to  the  correct  parameter  values. 
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The  variance  of  least  squares  estimators 

Although  the  property  of  being  unbiased  is  nice,  it  tells  us  only  that 
the  estimators  a  and  b  will  on  average  be  equal  to  a  and  /?.  However, 
in  practice  we  often  have  only  a  single  data  set  at  our  disposal.  Then  it 
is  important  that  the  deviations  (b  -  ft)2  and  (a  —  a)2  are  expected  to 
be  small.  We  measure  the  accuracy  by  the  mean  squared  errors  E[(b  —  p)2] 
and  £[(<7  —  a)2].  As  the  estimators  are  unbiased,  these  MSEs  are  equal  to  the 
variances  var (b)  and  var (a)  respectively.  It  follows  from  (2.19)  that 


var (b)  =  EE  c,CjE[e.tSj], 
and  Assumptions  3  and  4  and  (2.23)  give 


var 


(&)  =  S 


2  2 
C:G  — 


E  (*»  -  x)2 

The  variance  of  a  follows  from  (2.21)  and  (2.24)  with  result 


1 


var  (a)  =  d2a 2  =  a2  -  + 

Z—/  1  yt 


X 2 


E  (xi  -  x 


(2.27) 


(2.28) 


Graphical  illustration 

In  Exhibit  2.9  we  show  four  scatters  generated  with  simulations  of  the  type 
described  in  Section  2.2.1,  with  different  values  for  the  error  variance  a1  and 
for  the  systematic  variance  E  (xi  ~  x)2.  The  shapes  of  the  scatters  give  a  good 
impression  of  the  possibilities  to  determine  the  regression  line  accurately.  The 
best  case  is  small  error  variance  and  large  systematic  variance  (shown  in  (b)), 
and  the  worst  case  is  large  error  variance  and  small  systematic  variance 
(shown  in  (c)). 

Mean  and  variance  of  residuals 

In  a  similar  way  we  can  derive  the  mean  and  variance  of  the  residuals  e„ 
where  e,  =  y,  —  a  —  bxt.  There  holds  £[e,]  =  0  and 

var(e,)  =  ff2fl (2-29) 
V  n  E  (x,  -  W) 

(see  Exercise  2.7).  Note  that  this  variance  is  smaller  than  the  variance  a1  of 
the  disturbances  £,.  The  reason  is  that  the  method  of  least  squares  tries  to 
minimize  the  sum  of  squares  of  the  residuals.  Note  also  that  the  difference  is 
small  if  n  and  E  (xj  ~  x)2  are  large.  If  both  n  and  E  ( Xj  —  x)2  tend  to  infinity, 
then  var(e,)  tends  to  u2. 
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(a) 


Y  vs.  X 


(C) 

Y  vs.  X 


Exhibit  2.9  Accuracy  of  least  squares 


x 


Scatter  diagrams  of  y  against  x;  the  standard  deviation  of  x  in  ( b )  and  (d)  is  three  times  as  large 
as  in  (a)  and  (c),  and  the  standard  deviation  of  the  error  terms  in  (c)  and  ( d )  is  three  times  as 
large  as  in  (a)  and  ( b ). 


Exercises:  T:  2.4,  2.5,  2.6,  2.9. 


2.2.5  Efficiency 

^3  Uses  Section  1.2.1. 


Best  linear  unbiased  estimators  (BLUE) 

The  least  squares  estimators  a  and  b  given  in  (2.6)  and  (2.8)  are  linear 
expressions  in  yi,  ■  ■  ■ ,  yn.  Such  estimators  are  called  linear  estimators.  We 
have  shown  that  they  are  unbiased.  Now  we  will  show  that,  under  Assump¬ 
tions  1-6,  the  estimators  a  and  b  are  the  best  linear  unbiased  estimators 
(BLUE)  —  that  is,  they  have  the  smallest  possible  variance  in  the  class  of  all 


98  2  Simple  Regression 


linear  unbiased  estimators.  Stated  otherwise,  the  least  squares  estimators  are 
efficient  in  this  respect.  This  is  called  the  Gauss-Markov  theorem.  Note  that 
the  assumption  of  normality  is  not  needed  for  this  result. 


Proof  of  BLUE 

We  will  prove  this  result  for  b  (the  result  for  a  follows  from  a  more  general  result 
treated  in  Section  3.1.4).  Let  P  be  an  arbitrary  linear  estimator  of  p.  This  means 
that  it  can  be  written  as  P  =  J2giVi  f°r  certain  fixed  coefficients  g\,  ■  ■  ■ ,  g„.  The 
least  squares  estimator  can  be  written  as  b  =  ^  c,y,  with  c,  as  defined  in  (2.20). 
Now  define  w,  =  g,-  —  c,;  then  it  follows  that  g,  =  c,-  +  w,  and 


P  =  yy  ( Ci  +  w>)y>  =  b  +  yy w,y'- 

Under  Assumptions  1-6,  the  expected  value  of  p  is  given  by 

E[p] = E[b] + yy  wiE\yt]  =  p + « yy  Wj + p  yy 


(2.30) 


WiXi. 


We  require  unbiasedness,  irrespective  of  the  values  taken  by  a  and  p.  So  the  two 
conditions  on  w\,  ■  ■  ■ ,  w„  are  that 


y:  tv,  =  o,  yy  w, x,  =  o. 

It  then  follows  from  the  assumption  of  the  linear  model  (2.17)  that 

y:  =  a  yy  w, + p  y  w,*, + yy  =  yy  ^(e„ 

and  from  (2.30)  and  (2.19)  that 


(2.31) 


(2.32) 


p  =  b+ yy  w,£,  =  p + yy  (c, + wp?.-,. 

Because  of  Assumptions  3  and  4,  the  variance  of  p  is  equal  to 

var(^)  =  o’2  yy  (cj  +  wp2.  (2.33) 

Now  (c,  +  wp2  =  J2  cf  +  wf  +  2  c‘w»  and  the  expression  (2.20)  for  a 
together  with  the  properties  in  (2.31)  imply  that  J2ciWi  =  0.  Therefore  (2.33) 
reduces  to 


var  (p)  =  var(h)  +  a1  yy  w2. 

Clearly,  the  variance  is  minimal  if  and  only  if  w,  =  0  for  all  i  =  1,  •  •  • ,  n.  This 
means  that  P  =  b,  and  this  proves  the  Gauss-Markov  theorem. 


Exercises:  T:  2.3. 
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2.3  Significance  tests 


2.3.1  The  t- test 

Uses  Sections  1.2.3,  1.4. 1-1. 4. 3. 

The  significance  of  an  estimate 

The  regression  model  aims  to  explain  the  variation  in  the  dependent  variable 
y  in  terms  of  variations  in  the  explanatory  variable  x.  This  makes  sense  only 
if  y  is  related  to  x —  that  is,  if  p  ^  0  in  the  model  (2.17).  In  general,  the  least 
squares  estimator  b  will  be  different  from  zero,  even  if  /?  =  0.  We  want  to 
apply  a  test  for  the  null  hypothesis  p  =  0  against  the  alternative  that  p  ^  0. 
The  null  hypothesis  will  be  rejected  if  b  differs  significantly  from  zero.  Now  it 
is  crucial  to  realize  that,  under  Assumptions  1-6,  the  obtained  value  of  b  is 
the  outcome  of  a  random  variable.  So,  to  decide  whether  b  is  significant  or 
not,  we  have  to  take  the  uncertainty  of  this  random  variable  into  account. 
For  instance,  if  b  has  standard  deviation  100,  then  an  outcome  b  =  10  is  not 
significantly  different  from  zero,  and  if  b  has  standard  deviation  0.01  then 
an  outcome  b  =  0.1  is  significantly  different  from  zero.  Therefore  we  scale 
the  outcome  of  b  by  its  standard  deviation.  Further,  to  apply  the  testing 
approach  discussed  in  Section  1.4,  the  distribution  of  b  should  be  known. 


Derivation  of  test  statistic 

To  derive  a  test  for  the  significance  of  the  slope  estimate  b,  we  will  assume  that  the 
disturbances  e,  are  normally  distributed.  So  we  will  make  use  of  Assumptions  1-7 
of  Section  2.2.3.  Since  b  —  P  is  linear  in  the  disturbances  (see  (2.19)),  it  is  normally 
distributed  with  mean  zero  and  with  variance  given  by  (2.27).  So  the  standard 

deviation  of  b  is  given  by  erfo  =  (x,-  —  x)2  and 


b  —  p 

(7b 


N(0,  1). 


This  expression  cannot  be  used  as  a  test  statistic,  since  a  is  an  unknown  parameter. 
As  the  residuals  e,  are  estimates  of  the  disturbances  e,,  this  suggests  estimating 
the  variance  a2  =  E[sf]  by  <72=l^e2.  However,  this  estimator  is  biased.  It 
is  left  as  an  exercise  (see  Exercise  2.7)  to  show  that  an  unbiased  estimator  is 
given  by 
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*2  =  ^lE«?-  W4) 

We  also  refer  to  Sections  3.1.5  and  3.3.1  below,  where  it  is  further  proved 
that  e2/<r2  follows  the  %2(n  —  2)  distribution  and  that  s2  and  b  are  independent. 


Standard  error  and  t-value 

It  follows  from  the  above  results  and  by  the  definition  of  the  t-distribution 
that 


b  -  P  _  b-  P  _  (b  -  P)/ob 
Sb  s/^Yl  ( Xi-X )2  yfi^L/(n-2) 


2), 


where 


(2.35) 


sb=  .  S  =•  (2.36) 

yE  (xt-x)2 

That  is,  ty  follows  the  Student  f-distribution  with  n  —  2  degrees  of  freedom. 
For  p  =  0,  ty  is  called  the  t-valne  of  b.  Further,  sy  is  called  the  standard  error  of 
b,  and  s,  the  square  root  of  (2.34),  is  called  the  standard  error  of  the  regression. 
The  null  hypothesis  Ho  :  ft  =  0  is  rejected  against  the  alternative  H\  :  ft  ^  0  if  b 
is  too  far  from  zero  —  that  is,  if  \ty\  >  c,  or  equivalently,  if  \b\  >  csy.  Then  b  is 
called  significant  —  that  is,  it  differs  significantly  from  zero. 


A  practical  rule  of  thumb  for  significance 

For  a  given  level  of  significance,  the  critical  value  c  is  obtained  from  the 
tin  —  2)  distribution.  For  a  5  per  cent  significance  level,  the  critical  value  for 
n  =  30  is  c  =  2.05,  for  n  =  60  it  is  2.00,  and  for  n  — >  oo  the  critical  value 
converges  to  1.96.  As  a  rule  of  thumb  (for  the  popular  5  per  cent  significance 
level),  one  often  uses  c  =  2  as  an  approximation.  In  this  case  the  estimate  b  is 
significant  if  \b\  >  2s y  —  that  is,  if  the  outcome  is  at  least  twice  as  large  as  the 
uncertainty  in  this  outcome  as  measured  by  the  standard  deviation.  That  is,  an 
estimated  coefficient  is  significant  if  its  t-value  is  (in  absolute  value )  larger  than  2. 


Interval  estimates 

The  foregoing  results  can  also  be  used  to  construct  interval  estimates  of  p.  Let 
c  be  the  critical  value  of  the  f-test  of  size  a,  so  that  P[\ty\  >  c\  =  a  where  ty  is 
defined  as  in  (2.35).  Then  P[\ty\  <  c]  =  1  —  x,  and  an  (1  —  a)  interval  esti¬ 
mate  of  P  is  given  by  all  values  for  which  —c<ty<  c  —  that  is, 

b  —  csy  <  P  <  b  +  csy.  (2.37) 

“S3  Exercises:  T:  2.1c,  d,  2.7;  E:  2.10d-f,  2.12b,  c,  2.13,  2.14a-c. 
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2.3.2  Examples 


Example  2.8:  Simulated  Regression  Data  (continued) 

First  we  consider  the  situation  of  simulated  data  with  a  known  DGP.  For 
this  purpose  we  consider  again  the  10,000  simulated  data  sets  of  Example 
2.4,  with  data  generating  process  y,  =  10  +  x,  +  e,  where  x,  —  i  and  s,  are 
NID(0, 25),  i=  1,  •••,  20.  So  this  DGP  has  slope  parameter  /?  =  1.  The 
histograms  and  some  summary  statistics  of  the  resulting  10,000  values  of 
b,  s 2,  and  tj,  are  given  in  Exhibit  2.10  ( a-d ).  For  comparison,  (d)  also 
contains  some  properties  of  the  corresponding  theoretical  distributions. 


(a) 


( c ) 


( b ) 


(d) 


B  S2 

Theor.  Expect. 

Theor.  Std.  Dev. 

Theor.  Corr.  (B,  S2) 
Sample  Corr.  (B,  S2) 

1  25 

0.1939  8.3333 

0 

-0.0147 

(e) 

80 

60 

S2  40 

20 

0 

0.0  0.5  1.0  1.5  2.0 

B 

Exhibit  2.10  Simulated  Regression  Data  (Example  2.8) 


Histograms  of  least  squares  estimates  of  slope  (b,  denoted  by  B  in  (a)),  t-value  (f/,,  denoted  by 
TSTAT_B  in  (b)),  and  variance  (s2,  denoted  by  S2  in  (c))  resulting  from  10,000  simulations 
of  the  data  generating  process  in  Example  2.4  (with  slope  /?  =  1  and  variance  a2  =  25). 
Theoretical  means  and  standard  deviations  of  b  and  s 1  and  theoretical  and  sample  correlations 
between  b  and  s2  ( d )  and  scatter  diagram  of  s1  against  b  (e). 


E 


102  2  Simple  Regression 


The  histogram  of  b  is  in  accordance  with  the  normal  distribution 
of  this  least  squares  estimator,  and  the  histogram  of  s 2  is  in  accordance 
with  the  (scaled)  y2  distribution.  The  t-statistic  for  the  null  hypothesis 
Ho  :  fi  =  0  does  not  follow  the  t-distribution,  as  this  hypothesis  is  not  correct 
(P  =  1).  In  the  great  majority  of  cases  the  null  hypothesis  is  rejected;  only 
in  nineteen  cases  it  is  not  rejected  (using  the  5  per  cent  critical  value  c  —  2.1 
for  the  f(18)  distribution).  This  indicates  a  high  power  (of  around  99.8 
per  cent)  of  the  t-test  in  this  example.  The  scatter  diagram  of  s2  against 
b  (shown  in  (e))  illustrates  the  independence  of  these  two  random  vari¬ 
ables;  their  sample  correlation  over  the  10,000  simulation  runs  is  less  than 
1.5  per  cent. 

This  simulation  illustrates  the  distribution  properties  of  b,  s 2,  and  tb.  In 
this  example,  the  t-test  is  very  successful  in  detecting  a  significant  effect  of  the 
variable  x  on  the  variable  y. 


E 
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Example  2.9:  Bank  Wages  (continued) 

Next  we  consider  a  real  data  set  —  namely,  on  bank  wages.  For  the  salary 
and  education  data  of  bank  employees  discussed  before  in  Example  2.2, 
the  sample  moments  were  given  in  Section  2.1.4.  Using  the  results  in 
Exhibit  2.6,  the  variance  of  the  disturbance  terms  is  estimated  by 
s2  =  SSR/(n  -  2)  =  38.424/472  =  0.0814  and  the  standard  error  of  the 
regression  is  s  =  0.285.  So  the  standard  error  of  b  is  sb  =  0.285/ 
a/3937  =  0.00455  and  the  t-value  is  tb  =  0.096/0.00455  =  21.1.  To 
perform  a  5  per  cent  significance  test  of  Ho:  /f  =  0  against  Hp/i  ^  0,  the 
(two-sided)  critical  value  of  the  t( 472)  distribution  is  given  by  1.96,  so 
that  the  null  hypothesis  is  clearly  rejected.  This  means  that  education  has 
a  very  significant  effect  on  wages.  These  outcomes  are  also  given  in 
Exhibit  2.11,  together  with  the  P-value  for  the  test  of  Hq:  [1  =  0  against 
Hj:  P  ^  0.  The  P-value  is  reported  as  0.0000,  which  actually  means  that  it 
is  smaller  than  0.00005.  Note  that  this  P-value  is  not  exactly  zero,  as  even 
for  fi  =  0  the  probability  of  getting  t-values  larger  than  21.1  is  non-zero. 
However,  the  null  hypothesis  that  /l  =  0  is  rejected  for  all  sizes  a  >  0.00005. 
For  such  a  low  P-value  as  in  this  example  we  will  always  reject  the  null 
hypothesis. 

The  regression  results  are  often  presented  in  the  following  way,  where  the 
numbers  in  parentheses  denote  the  t-values  and  e  denotes  the  residuals  of  the 
regression  (the  equation  without  e  is  not  valid,  as  the  data  do  not  lie  exactly 
on  the  estimated  line). 


y  =  9.06  +  0.096x  +  e. 
(144)  (21.1) 
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Dependent  Variable:  LOGSALARY 

Method:  Least  Squares 

Sample:  1  474 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

9.062102 

0.062738 

144.4446 

0.0000 

EDUC 

0.095963 

0.004548 

21.10214 

0.0000 

R-squared 

Sum  squared  resid 
S.E.  of  regression 

0.485447 

38.42407 

0.285319 

Mean  dependent  var 
S.D.  dependent  var 

10.35679 

0.397334 

Exhibit  2.1 1  Bank  Wages  (Example  2.9) 

Results  of  regression  of  salary  (in  logarithms)  on  a  constant  (denoted  by  C)  and  education, 
based  on  data  of  474  bank  employees. 


2.3.3  Use  under  less  strict  conditions 

Weaker  assumptions  on  the  DGP 

The  rather  strict  conditions  of  Assumptions  1-7  (in  particular,  fixed  values  of 
the  x  variable  and  normally  distributed  disturbances)  were  introduced  in 
order  to  simplify  the  proofs.  Fortunately,  the  same  results  hold  approxi¬ 
mately  true  under  more  general  conditions.  In  Exhibit  2.12  we  present  the 
results  of  a  number  of  simulation  experiments  where  the  conditions  on  both 
the  explanatory  variable  and  the  disturbances  have  been  varied.  The  exhibit 
reports  some  quantiles.  If  a  random  variable  y  has  a  strictly  monotone 
cumulative  distribution  function  F(v)  =  P[y  <  v\,  then  the  quantile  q(p)  is 
defined  by  the  condition  that  F(q(p) )  =  p.  In  other  words,  the  quantile 
function  is  the  inverse  of  the  cumulative  distribution  function.  The  exhibit 
shows  quantiles  for  p  =  0.75,  p  =  0.90,  p  =  0.95,  and  p  =  0.975.  The  last 
quantile  corresponds  to  the  critical  value  for  a  two-sided  test  with  signifi¬ 
cance  level  5  per  cent. 


Row 

£ 

Result 

Quantiles 

0.750  0.900 

0.950 

0.975 

1 

Fixed 

Normal 

Exact  t(198) 

0.676 

1.286 

1.653 

1.972 

2 

Fixed 

Normal 

Simulated 

0.675 

1.290 

1.651 

1.984 

3 

Fixed 

Logistic 

Simulated 

0.678 

1.289 

1.656 

1.986 

4 

Normal 

Normal 

Simulated 

0.677 

1.285 

1.653 

1.980 

5 

Normal 

Logistic 

Simulated 

0.679 

1.287 

1.650 

1.982 

Exhibit  2.12  Quantiles  of  distributions  of  f-statistics 

Rows  1  and  2  correspond  to  the  standard  model  that  satisfies  Assumptions  1-7,  in  rows  3  and 
5  the  DGP  does  not  satisfy  Assumption  7  (normality),  and  in  rows  4  and  5  the  DGP  does  not 
satisfy  Assumption  1  (fixed  regressors). 
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Discussion  of  simulation  results 

Rows  1  and  2  of  the  table  give  the  results  for  the  classical  linear  model,  where 
the  x  values  are  fixed  and  the  disturbances  are  independently,  identically 
normally  distributed.  The  first  row  gives  the  exact  results  corresponding  to 
the  t(  198)  distribution.  The  second  row  gives  the  results  from  a  simulation 
experiment  where  50,000  samples  were  drawn,  each  of  n  =  200  observa¬ 
tions.  The  remaining  rows  give  the  results  of  further  simulation  experiments 
(each  consisting  of  50,000  simulation  runs)  under  different  conditions.  In 
row  3  the  disturbances  are  drawn  from  a  logistic  distribution  with  density 
function  f(x)=ex/(  1  +  e*)2  and  with  cumulative  distribution  function 
F(x)  =  1/(1  +  e~x ).  This  density  is  bell-shaped  but  the  tails  are  somewhat 
fatter  than  those  of  the  normal  density.  In  rows  4  and  5  the  values  of  the  x 
variable  are  no  longer  kept  fixed  along  the  different  simulation  runs,  but 
instead  they  are  drawn  from  a  normal  distribution,  independently  of  the 
disturbances.  To  enhance  the  comparability  of  the  results  the  same  x  values 
were  used  in  rows  4  and  5.  Likewise,  the  same  disturbances  were  used  in 
rows  2  and  4,  and  in  rows  3  and  5. 

Conclusion 

When  we  compare  the  quantiles,  we  see  that  the  differences  between  the 
rows  are  very  small.  This  illustrates  that  we  may  apply  the  formulas  derived 
under  the  assumptions  of  the  linear  model  also  in  cases  where  the  assump¬ 
tions  of  fixed  regressors  or  normal  disturbances  are  not  satisfied.  Under  the 
assumptions  of  this  simulation  example  this  still  gives  reliable  results. 
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2.4  Prediction 


2.4.1  Point  predictions  and  prediction  intervals 

Point  prediction 

We  consider  the  use  of  an  estimated  regression  model  for  the  prediction  of  the 
outcome  of  the  dependent  variable  y  for  a  given  value  of  the  explanatory 
variable  x. 

The  least  squares  residuals  e ,  =  y,  —  a  —  bx,  correspond  to  the  deviations 
of  yi  from  the  fitted  values  a  +  bxj,  i  =  1,  •  •  • ,  n.  The  regression  line  a  +  bx 
can  he  interpreted  as  the  prediction  of  the  y-value  for  a  given  x- value,  and  s2 
indicates  the  average  accuracy  of  these  predictions.  Now  assume  that  we 
want  to  predict  the  outcome  yn+\  for  a  given  new  value  xn+\.  An  obvious 
prediction  is  given  by  a  +  bxn+ 1.  This  is  called  a  point  prediction. 

Prediction  error  and  variance 

In  order  to  say  something  about  the  accuracy  of  this  prediction  we  need  to 
make  assumptions  about  the  mechanism  generating  the  value  of  yn+i-  We 
suppose  that  Assumptions  1-6  hold  true  for  i  =  1,  •••,«  +  1.  If  at  a  later 
point  of  time  we  observe  yn+  \ ,  we  can  evaluate  the  quality  of  our  prediction 
by  computing  the  prediction  error 


f  =  yn+1  -  a  -  bxn+ 1.  (2.38) 

If  yn+ 1  is  unknown,  we  can  get  an  idea  of  the  prediction  accuracy  by  deriving 
the  mean  and  variance  of  the  prediction  error.  Under  Assumptions  1-6,  the 
mean  is  E[f ]  =  0,  so  that  the  prediction  is  unbiased,  and  the  variance  is 
given  by 


(2391 

Here  the  average  x  and  the  summation  refer  to  the  estimation  sample 
i  =  1,  •  •  • ,  n.  The  proofs  are  left  as  an  exercise  (see  Exercise  2.8).  Note  that 
the  variance  of  the  prediction  error  is  larger  than  the  variance  er2  of  the 
disturbances.  The  extra  terms  are  due  to  the  fact  that  a  and  b  are  used  rather 


106  2  Simple  Regression 


Exhibit  2.13  Prediction  error 


Uncertainty  in  the  slope  of  the  regression  line  (indicated  by  the  lower  value  and  the  upper 
value  b\]  of  an  interval  estimate  of  the  slope)  results  in  larger  forecast  uncertainty  for  values 
of  the  explanatory  variable  that  are  further  away  from  the  sample  mean  (the  forecast  interval 
fi  corresponding  to  X2  is  larger  than  the  interval  f\  corresponding  to  X\). 


than  a  and  /L  It  is  also  seen  that  the  variance  of  the  prediction  error  reaches  its 
minimum  for  xn+\  =  x  and  that  the  prediction  errors  tend  to  be  larger  for 
values  of  xn+i  that  are  further  away  from  x.  By  using  the  expression  (2.6)  for 
a,  (2.38)  can  be  written  as  f  =  (y„+ 1  —  y)  —b(x„+ 1  —  x).  So  uncertainty  about 
the  slope  b  of  the  regression  line  leads  to  larger  forecast  uncertainty  when 
x„+i  is  further  away  from  x.  This  is  illustrated  in  Exhibit  2.13. 


Prediction  interval 

The  above  results  can  also  be  used  to  construct  prediction  intervals.  If 
Assumptions  1-7  hold  true  for  »=  1,  ••■,«+  1,  then  the  prediction  error  f 
is  normally  distributed  and  independent  of  s2,  based  on  the  first  n  observa¬ 
tions  and  defined  in  (2.34).  Let 


s 


2 

f 


(Xf;  ■  |  X)~  \ 

E  'Ll 


then  it  follows  that  f  /sf  ~  tin  —  2).  So  a  (1  —  a)  prediction  interval  for  yn+\  is 
given  by 


[a  +  bxn+ 1  —  csf ,  a  +  bxn+\  +  csf ) 
where  c  is  such  that  P[|t|  >  c\  =  a  when  t  ~  t(n  —  2). 
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Conditional  prediction 

In  the  foregoing  results  the  value  of  xn+\  should  be  known.  Therefore  this  is 
called  conditional  prediction,  in  contrast  to  unconditional  prediction,  where 
the  value  of  xn+\  is  unknown  and  should  also  be  predicted.  Since  our  model 
does  not  contain  a  mechanism  to  predict  xn+\,  this  would  require  additional 
assumptions  on  the  way  the  x-values  are  generated. 

Exercises:  T:  2.8;  E:  2.14d,  e,  2.15. 


2.4.2  Examples 

Example  2.10:  Simulated  Regression  Data  (continued) 

Consider  once  more  the  10,000  simulated  data  sets  of  Example  2.4.  We 
consider  two  situations,  one  where  the  new  value  of  xn  =  10  is  in  the  middle 
of  the  sample  of  previous  x-values  (that  range  between  1  and  20)  and  another 
where  X21  =  40  lies  outside  this  range.  For  both  cases  we  generate  10,000 
predictions  a  +  bx 21,  each  prediction  corresponding  to  the  values  of  [a,  b) 
obtained  for  one  of  the  10,000  simulated  data  sets.  Further  we  also  generate 
in  both  cases  10,000  new  values  of  721  =  10  +  X21  +  £21  by  random  drawings 
£21  of  the  N(0,  25)  distribution. 

Exhibit  2.14  shows  histograms  and  summary  statistics  of  the  resulting  two 
sets  of  10,000  predictions  (in  (a)  and  (c) )  and  of  the  prediction  errors 
/21  =  y2i  —  {a  +  bx 21)  (in  (b),  (d),  and  (e)).  Clearly,  for  X21  =  10  the  predic¬ 
tions  and  forecast  errors  have  a  smaller  standard  deviation  than  for  X21  =  40, 
as  would  be  expected  because  of  (2.39). 


E 


Example  2.1 1 :  Bank  Wages  (continued) 

We  consider  again  the  salary  and  education  data  of  bank  employees.  We  will 
discuss  (i)  the  splitting  of  the  sample  in  two  sub-samples,  (ii)  the  forecasts, 
and  (iii)  the  interpretation  of  the  forecast  results. 

(i)  Splitting  of  the  sample  in  two  sub-samples 

To  illustrate  the  idea  of  prediction  we  split  the  data  set  up  in  two  parts.  The 
first  part  (used  in  estimation)  consists  of  424  individuals  with  sixteen  years  of 
education  or  less,  the  second  part  (used  in  prediction)  consists  of  the 
remaining  50  individuals  with  seventeen  years  of  education  or  more.  In  this 
way  we  can  investigate  whether  the  effect  of  education  on  salary  is  the  same 
for  lower  and  higher  levels  of  education. 
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(a)  ( b ) 


(c)  (d) 


(e) 


F10  F40 

Theor.  Expect. 
Theor.  Std.  Dev. 

0  0 
5.1243925  7.6789430 

Exhibit  2.14  Simulated  Regression  Data  (Example  2.10) 

Forecasted  values  of  y  ((a)  and  (c))  and  forecast  errors  f  [(b)  and  (d))  in  10,000  simulations 
from  the  data  generating  process  of  Example  2.4  for  two  values  ofx —  that  is,  x  =  10  ((a)-[b)) 
and  x  =  40  ((c)-(d)),  together  with  theoretical  expected  values  and  standard  deviations  of  the 
forecast  errors  (denoted  by  F10  for  x  =  10  and  F40  for  x  =  40  (e)). 


(ii)  Forecasts 

The  results  of  the  regression  over  the  first  group  of  individuals  are  shown  in 
Exhibit  2.15  (a).  The  estimated  intercept  is  a  =  9.39  and  the  estimated  slope 
is  b  =  0.0684.  With  this  model  the  salary  of  an  individual  in  the  second 
group  with  an  education  of  x  years  is  predicted  by  a  +  bx  =  9.39  +  0.0684x. 

(iii)  Interpretation  of  forecast  results 

Exhibit  2.15  (d)  shows  that  the  actual  salaries  of  these  highly  educated 
persons  are  systematically  higher  than  predicted.  We  mention  the  following 
facts.  The  average  squared  prediction  error  (for  the  fifty  highly  educated 
employees)  is  equal  to  J2^425  //V-50  =  0.268.  This  is  larger  than  the  average 
squared  residual  X4J425  / 5 0  =  0.142  if  the  estimates  a  =  9.06  and 

b  =  0.0960  are  used  that  were  obtained  from  a  regression  over  the  full 
sample  in  Section  2.1.4  (see  Exhibit  2.6).  Moreover,  the  average  squared 
prediction  error  is  also  larger  than  what  would  be  expected  on  the  basis  of 
(2.39),  which  is  based  on  Assumptions  1-7  for  the  DGP.  If  we  average  this 
expression  over  the  fifty  values  of  education  (x)  in  the  second  group,  with  the 
estimated  variance  s2  =  (0.262)2  =  0.0688  obtained  from  the  regression 
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Dependent  Variable:  LOGSALARY 

Method:  Least  Squares 

Sample:  1  424  (individuals  with  at  most  16  years  of  education) 

Variable 

Coefficient 

Std.  Error  t-Statistic 

Prob. 

C 

9.387947 

0.068722  136.6081 

0.0000 

EDUC 

0.068414 

0.005233  13.07446 

0.0000 

R-squared 

0.288294 

Mean  dependent  var 

10.27088 

Sum  squared  resid 

29.02805 

S.D.  dependent  var 

0.310519 

S.E.  of  regression 

0.262272 

(b)  (c) 

LOGSALARY  vs.  EDUC  LOGSALARY  vs.  EDUC 


EDUC  EDUC 

id) 

LOGSALARY  vs.  EDUC  (steep  regression  line) 

FORECAST  vs.  EDUC  (flat  regression  line) 


11.5 

11.0 

10.5 

10.0 


Exhibit  2.15  Bank  Wages  (Example  2.11) 

Result  of  regression  of  salary  (in  logarithms)  on  a  constant  and  education  for  424  bank 
employees  with  at  most  sixteen  years  of  education  (a)  and  three  scatter  diagrams,  one  for  all 
474  employees  (b),  one  for  424  employees  with  at  most  sixteen  years  of  education  (c),  and  one 
for  all  474  employees  together  with  the  predicted  values  of  employees  with  at  least  seventeen 
years  of  education  ((d),  with  predictions  based  on  the  regression  in  (a)). 


over  the  424  individuals  with  sixteen  years  of  education  or  less  in  Exhibit 
2.15,  then  this  gives  the  value  0.139.  As  the  actual  squared  prediction  errors 
are  on  average  nearly  twice  as  large  (0.268  instead  of  0.139),  this  may  cast 
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some  doubt  on  the  working  hypothesis  that  Assumptions  1-7  hold  true  for 
the  full  data  set  of  474  persons.  It  seems  that  the  returns  on  education  are 
larger  for  higher-educated  employees  than  for  lower-educated  employees. 
We  will  return  to  this  question  in  Section  5.2.1. 
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Summary,  further  reading, 
and  keywords 


SUMMARY 

In  this  chapter  we  considered  the  simple  regression  model,  where  variations 
in  the  dependent  variable  are  explained  in  terms  of  variations  of  the  explana¬ 
tory  variable.  The  method  of  least  squares  can  be  used  to  estimate  the 
parameters  of  this  model.  The  statistical  properties  of  these  estimators  were 
derived  under  a  number  of  assumptions  on  the  data  generating  process. 
Further  we  described  methods  to  construct  point  predictions  and  prediction 
intervals. 

The  ideas  presented  in  this  chapter  form  the  basis  for  many  other  types  of 
econometric  models.  In  Chapter  3  we  consider  models  with  more  than  one 
explanatory  variable,  and  later  chapters  contain  further  extensions  that  are 
often  needed  in  practice. 


FURTHER  READING 

Most  of  the  textbooks  on  statistics  mentioned  in  Section  1.5  contain  chapters  on 
regression.  Econometric  textbooks  go  beyond  the  simple  regression  model.  In  the 
following  chapters  we  make  intensive  use  of  matrix  algebra,  and  references  to 
textbooks  that  also  follow  this  approach  are  given  in  Chapter  3,  Further  Reading 
(p.  178).  We  now  mention  some  econometric  textbooks  that  do  not  use  matrix 
algebra. 

Gujarati,  D.  N.  (2003).  Basic  Econometrics.  Boston:  McGraw-Hill. 

Hill,  R.  C.,  Griffiths,  W.  E.,  and  Judge,  G.  G.  (2001).  Undergraduate  Economet¬ 
rics.  New  York:  Wiley. 

Kennedy,  R  (1998).  A  Guide  to  Econometrics.  Oxford:  Blackwell. 

Maddala,  G.  S.  (2001).  Introduction  to  Econometrics.  London:  Prentice  Hall. 
Pindyck,  R.  S.,  and  Rubinfeld,  D.  L.  (1998).  Econometric  Models  and  Economic 
Forecasts.  Boston:  McGraw-Hill. 

Thomas,  R.  L.  (1997).  Modern  Econometrics.  Harlow:  Addison-Wesley. 
Wooldridge,  J.  M.  (2000).  Introductory  Econometrics.  Australia:  Thomson 
Learning. 
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THEORY  QUESTIONS 

2.1  (“®  Sections  2.1.2,  2.3.1) 

Let  two  data  sets  (x,-,  yi)  and  (x*,  y*)  be  related  by 
x*  =  ci  +  czXj  and  y*  =  c  3  +  C4y,  for  all 
i=  1,  ■■•,».  This  means  that  the  only  differences 
between  the  two  data  sets  are  the  location  and  the 
scale  of  measurement.  Such  data  transformations 
are  often  applied  in  economic  studies.  For  example, 
yi  may  be  the  total  variable  production  costs  in 
dollars  of  a  firm  in  month  i  and  y*  the  total  produc¬ 
tion  costs  in  millions  of  dollars.  Then  C3  are  the  total 
fixed  costs  (in  millions  of  dollars)  and  C4  =  10~6. 

a.  Derive  the  relation  y*  =  a*  +  f}*x*  between  y* 
and  x*  if  y  and  x  would  satisfy  the  linear  relation 
y  =  a.  +  fix. 

b.  For  arbitrary  data  (xt,  y ,),  i  =  1,  -,  n,  derive 

the  relation  between  the  least  squares  estimators 
(a,  b)  for  the  original  data  and  (a*,  b*)  for  the 
transformed  data. 

c.  Which  of  the  statistics  R 2,  s2,  S[ ,,  and  tf,  are  in¬ 
variant  with  respect  to  this  transformation? 

d.  Check  the  results  in  b  and  c  with  the 
excess  returns  data  of  Example  2.1  on 
stock  market  returns.  Perform  two  re¬ 
gressions,  one  with  the  original  data 
(in  percentages)  and  the  other  with  transformed 
data  with  the  actual  excess  returns  —  that  is, 
with  c\  =  C3  =  0  and  cx  =  C4  =  0.01. 


2.2  (“®  Section  2.1.3) 

In  the  regression  model  the  variable  y  is  regressed 
on  the  variable  x  with  resulting  regression  line 
a  +  bx.  Reversing  the  role  of  the  two  variables,  x 
can  be  regressed  on  y  with  resulting  regression  line 
c  +  dy. 

a.  Derive  formulas  for  the  least  squares  estimates  of 
c  and  d  obtained  by  regressing  x  on  y. 

b.  Show  that  bd  =  R2,  where  b  is  the  conventional 
least  squares  estimator  and  d  the  slope  estimator 
in  a. 


c.  Conclude  that  in  general  d  /  1/b.  Explain  this 
in  terms  of  the  criterion  functions  used  to  obtain 
b  and  d. 

d.  Finally,  check  the  results  in  b  and  c  by 
considering  again  the  excess  returns 
data  of  Example  2.1. 


2.3  (“®  Section  2.2.5) 

Suppose  that  Assumptions  1-6  are  satisfied.  We  con¬ 
sider  two  slope  estimators,  b\  =  (yn  —  y\)/(x„  —  xi) 
and  bx  =  Y1) ’i/Yxii  as  alternatives  for  the  least 
squares  estimator  b. 

a.  Investigate  whether  b  1  and  bx  are  unbiased  esti¬ 
mators  of  p. 

b.  Determine  expressions  for  the  variances  of  bi 
and  bx- 

c.  Show  that  var(£>i)  >  var (b). 

d.  Show  that  there  exist  data  x„  i=  1,  -,  n,  so 

that  vai(bx)  <  var (b).  Is  this  not  in  contradiction 
with  the  Gauss-Markov  theorem? 

2.4  (”§5  Section  2.2.4) 

Let  Assumption  6  be  replaced  by  the  assumption 
that  the  data  are  generated  by  y,  =  /lx,  +  £,,  so  that 
a  =  0  is  given.  We  wish  to  fit  a  line  through  the 
origin  by  means  of  least  squares  —  that  is,  by  min¬ 
imizing  Y,  (yi  ~  bx,)2. 

a.  Adapt  Assumptions  1  and  5  for  this  special  case. 

b.  Prove  that  the  value  of  b  that  minimizes  this  sum 
of  squares  is  given  by  bt  =  Y  xiyi/  YI  xf- 

c.  Find  the  mean  and  variance  of  this  estimator. 

d.  Investigate  whether  the  estimator  bx  of  Exercise 

2.3  is  unbiased  now,  and  show  that  var(bx)  > 
var  (bt). 

e.  Let  R2  be  defined  by  R2  =  b2t  Y  x] t  Y  vf-  Show, 
by  means  of  a  simulation  example,  that  the 
results  in  (2.11),  (2.12),  and  (2.15)  no  longer 
hold  true. 


t?  ^  Z 
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2.5  (“©  Section  2.2.4) 

Sometimes  we  wish  to  assign  different  weights  to  the 
observations.  This  is  for  instance  the  case  if  the 
observations  refer  to  countries  and  we  want  to  give 
larger  weights  to  larger  countries. 

a.  Find  the  value  of  b  that  minimizes  ^  Wiej  where 
the  weights  w\,---,wn  are  given  positive 
numbers  —  for  instance,  the  populations  or  the 
areas  of  the  countries.  Without  loss  of  generality 
it  may  be  assumed  that  the  weights  are  scaled  so 
that  £>,-  =  1 . 

b.  Now  suppose  that  Assumptions  1-6  are  satisfied. 
Is  the  estimator  of  a  unbiased? 

2.6  (“©  Section  2.2.4) 

Suppose  that  Assumptions  1,  2,  and  4-6  hold  but 
that  the  variances  of  the  disturbances  are  given  by 
E[ef  ]  =  a 2gi  (i  =  1,  •  •  • ,  n),  where  the  gi  are  known 
and  given  numbers.  Assume  that  b  is  computed 
according  to  (2.8). 

a.  Is  b  still  unbiased  under  these  assumptions? 

b.  Derive  the  variance  of  b  under  these  assumptions. 

c.  Verify  that  this  result  reduces  to  (2.27)  if  gi  =  1 
for  i  =  1,  •  •  ■ ,  n. 

2.7  (“§?  Section  2.3.1) 

In  this  exercise  we  prove  that  the  least  squares  esti¬ 
mator  s2  is  unbiased.  Prove  the  following  results 
under  Assumptions  1-6. 

a.  e,  =  (yi  -y)~  b(xi  -  x)  =  -(x,-  -  x)(b  —  yS)  4-  £,-  -  e. 

b.  E[(Bi  -  £>2]  =  <x2(l  -  i )  and  E[(b  -  P)(s,  -  e)]  = 


c.  E[e,\  =  0  and  var(e,)  =  a2  ^1  -  \  -  ■ 


d.  £[s2]  =  a2,  where  s2  is  defined  in  (2.34). 

2.8  (“®  Section  2.4.1) 

Under  the  assumptions  stated  in  Section  2.4.1, 
prove  the  following  results  for  the  prediction  error 
f  in  (2.38).  The  notation  x,  y,  and  £  is  used  to 
denote  sample  averages  over  the  estimation  sample 
i  =  1,  •  ■  • ,  n. 

a-  f  =  (y„+i  -  y)  -  b(xn+\  -  x) 

=  ~(x„+i  -  x)(b  -  P)  +  £„+i  -  e,  and  E[f]  =  0. 

b.  £[(eb+i  —  e)2]  =  cr2(l  +  1 ).  Explain  the  differ¬ 
ence  with  the  first  result  in  Exercise  2.7b. 

c.  Prove  the  result  (2.39). 

d.  Comment  on  the  difference  between  this  result 
and  the  one  in  Exercise  2.7c;  in  particular,  ex¬ 
plain  why  var(/”)  >  var(e,)- 

2.9  (“§3  Section  2.2.4) 

Suppose  that  data  are  generated  by  a  process  that 
satisfies  Assumptions  1  and  3-6,  but  that  the 
random  disturbances  £,  do  not  have  mean  zero  but 
that  E[ £,]  =  gj. 

a.  Show  that  the  least  squares  slope  estimator  b 
remains  unbiased  if  /r,  =  /r  is  constant  for  all 
i  =  1,  •  •  • ,  n. 

b.  Now  suppose  that  /(,  =  x,/10  is  proportional  to 
the  level  of  x,.  Derive  the  bias  E[b]  —  P  under 
these  assumptions. 

c.  Discuss  whether  Assumption  2  can  be  checked  by 
considering  the  least  squares  residuals  e,,  i  = 
1,  •  •  ■ ,  n.  Consider  in  particular  the  situations  of 
a  and  b. 


EMPIRICAL  AND  SIMULATION  QUESTIONS 


2.10  (“®  Sections  2.1.2,  2.1.3,  2.2.3, 

2.3.1) 

Consider  the  set  of  n  =  12  observations  on 
price  x,  and  quantity  sold  y,  for  a  brand  of 
coffee  in  Example  2.3.  It  may  be  instructive  to  per¬ 
form  the  calculations  of  this  exercise  only  with  the 
help  of  a  calculator.  For  this  purpose  we  present  the 
data  in  the  following  table. 


a.  Discuss  whether  the  Assumptions  1-6  are  plaus¬ 
ible  for  these  data. 

b.  Compute  the  least  squares  estimates  of  a  and  p  in 
the  model  y  =  a  +  Px,  +  £,  . 

c.  Compute  SST,  SSE,  SSR,  and  R2  for  these  data. 

d.  Estimate  the  variance  a2  of  the  disturbance 
terms. 
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i 

Price 

Quantity 

i 

1.00 

89 

2 

1.00 

86 

3 

1.00 

74 

4 

1.00 

79 

5 

1.00 

68 

6 

1.00 

84 

7 

0.95 

139 

8 

0.95 

122 

9 

0.95 

102 

10 

0.85 

186 

11 

0.85 

179 

12 

0.85 

187 

e.  Compute  the  standard  error  of  b  and  test  the  null 
hypothesis  Ho:/?  =  0  against  the  alternative 
Hi:/?  /  0,  using  a  5%  significance  level  (the  cor¬ 
responding  two-sided  critical  value  of  the  t(10) 
distribution  is  c  =  2.23). 

f.  Compute  a  95%  interval  estimate  of  /?. 

2.11  (“®  Section  2.1.3) 

Consider  the  excess  returns  data  set  de¬ 
scribed  in  Example  2.1  on  stock  market 
returns,  with  x,  the  excess  returns  of  the 
market  index  and  y,  the  excess  returns  in  the  sector 
of  cyclical  consumer  goods. 

a.  Perform  two  regressions  of  y  on  x,  one  in  the 
model  y,-  =  a  +  /?x,  +  e,  and  the  second  in  the 
model  y,  =  /?x,  +  £,-. 

b.  Check  the  conditions  (2.11)  and  (2.12)  for  both 
models. 

c.  Investigate  the  correlation  between  the  two  series 
of  residuals  obtained  in  a.  Can  you  explain  this 
outcome? 

2.12  (=©  Sections  2.1.2,  2.3.1) 

Consider  again  the  stock  market  returns 
data  of  Example  2.1,  with  the  x-variable 
for  the  excess  returns  for  the  whole 
market  and  with  the  y-variable  for  the  excess  returns 
for  the  sector  of  cyclical  consumer  goods. 

a.  Use  a  software  package  to  compute  the 
sample  means  x  and  y  and  the  sample 
moments  E  ( xi  ~  x)2/n,  E  iVi  ~  y)2 /«,  and 
E  (*<  -  x)  (y,  -  y)/n. 

b.  Compute  a ,  b,  s2,  and  R 1  from  the  statistics  in  a. 

c.  Check  the  results  by  performing  a  regression  of  y 
on  x  by  means  of  a  software  package. 


2.13  (-©  Section  2.3.1) 

Consider  the  data  generating  process  defined  in 
terms  of  Assumptions  1-7  with  the  following  speci¬ 
fications.  In  Assumption  1  take  n  =  10  and 
Xi  =  100  +  i  for  i  =  1,  ■  •  • ,  10,  in  Assumption  3 
take  ff2  =  1,  and  in  Assumption  6  take  <x  =  —100 
and  /?  =  1.  Note  that  we  happen  to  know  the  par¬ 
ameters  of  this  DGP,  but  we  will  simulate  the  situ¬ 
ation  where  the  modeller  knows  only  a  set  of  data 
generated  by  the  DGP,  and  not  the  parameters  of 
the  DGP. 

a.  Simulate  one  data  set  from  this  model  and  deter¬ 
mine  the  least  squares  estimates  a  and  b,  the  stand¬ 
ard  errors  of  a  and  b,  and  the  f -values  of  a  and  b. 

b.  Determine  95%  interval  estimates  for  a  and  /?. 

c.  Repeat  steps  a  and  b  100  times.  Compare  the 
resulting  variances  in  the  100  estimates  a  and  b 
with  the  theoretical  variances.  How  many  of  the 
100  computed  interval  estimates  contain  the  true 
values  of  a  and  /?? 

d.  Now  combine  the  data  into  one  large  data  set 
with  1000  observations.  Estimate  a  and  /?  by 
using  all  1000  observations  simultaneously  and 
construct  95%  interval  estimates  for  a  and  /?. 
Discuss  the  resulting  outcomes. 


2.14  (=©  Sections  2.3.1,  2.4.1) 

Consider  the  data  set  of  Exercise  1.11  on 
student  learning,  with  FGPA  and  SATM 
scores  of  ten  students.  We  investigate  how 
far  the  FGPA  scores  of  these  students  can  be  ex¬ 
plained  in  terms  of  their  SATM  scores. 

a.  Regress  the  FGPA  scores  on  a  constant  and 
SATM  and  compute  a,  b,  and  s2. 

b.  Perform  5%  significance  tests  on  a  and  b. 

c.  Construct  95%  interval  estimates  for  a  and  /?. 

d.  Make  a  point  prediction  of  the  FGPA  score  for  a 
student  with  SATM  score  equal  to  6.0.  Construct 
also  a  95%  prediction  interval. 

e.  Discuss  the  conditions  needed  to  be  confident 
about  these  predictions. 

2.15  (■*>  Section  2.4.1) 

Consider  the  CAPM  of  Example  2.5  for 
the  stock  market  returns  data  on  the 
excess  returns  y,  for  the  sector  of  cyclical 
consumer  goods  and  x,  for  the  market  index.  This 
data  set  consists  of  240  monthly  returns.  We  pay 
special  attention  to  the  ‘crash  observation’  i  =  94 
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corresponding  to  October  1987  when  a  crash  took 

place. 

a.  Estimate  the  CAPM  using  all  the  available  data, 
and  a  second  version  where  the  crash  observation 
is  deleted  from  the  sample. 

b.  Compare  the  outcomes  of  the  two  regressions. 

c.  Use  the  second  model  (estimated  without  the 
crash  observation)  to  predict  the  value  of  3^94 
for  the  given  historical  value  of  X94.  Construct 
also  four  prediction  intervals,  with  confidence 
levels  50%,  90%,  95%,  and  99%.  Does  the 
actual  value  of  3*94  belong  to  these  intervals? 

d.  Explain  the  relation  between  your  findings  in  b 
and  c. 


e.  Answer  questions  a  and  b  also  for  some  other 
sectors  instead  of  cyclical  consumer  goods  — 
that  is,  for  the  three  sectors  ‘Noncyclical  Con¬ 
sumer  Goods’,  ‘Information  Technology’,  and 
‘Telecommunication,  Media  and  Technology’. 

f.  For  each  of  the  four  sectors  in  a  and  e,  test  the 

null  hypothesis  Ho :  /?  =  1  against  the  alternative 
Hi:/?  1,  using  the  data  over  the  full  sample 

(that  is,  including  the  crash  observation).  For 
which  sectors  should  this  hypothesis  be  rejected 
(at  the  5%  significance  level)? 

g.  Relate  the  outcomes  in  f  to  the  risk  of  the  differ¬ 
ent  sectors  as  compared  to  the  total  market  in  the 
UK  over  the  period  1980-99. 


3 


Multiple  Regression 


In  practice  there  often  exists  more  than  one  variable  that  influences  the 
dependent  variable.  This  chapter  discusses  the  regression  model  with  mul¬ 
tiple  explanatory  variables.  We  use  matrices  to  describe  and  analyse  this 
model.  We  present  the  method  of  least  squares,  its  statistical  properties, 
and  the  idea  of  partial  regression.  The  T-test  is  the  central  tool  for  testing 
linear  hypotheses,  with  a  test  for  predictive  accuracy  as  a  special  case. 
Particular  attention  is  paid  to  the  question  whether  additional  variables 
should  be  included  in  the  model  or  not. 
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3.1  Least  squares  in  matrix  form 

Uses  Appendix  A. 2- A. 4,  A.  6,  A.  7. 


3.1.1  Introduction 

More  than  one  explanatory  variable 

In  the  foregoing  chapter  we  considered  the  simple  regression  model  where 
the  dependent  variable  is  related  to  one  explanatory  variable.  In  practice  the 
situation  is  often  more  involved  in  the  sense  that  there  exists  more  than  one 
variable  that  influences  the  dependent  variable. 

As  an  illustration  we  consider  again  the  salaries  of  474  employees  at  a 
US  bank  (see  Example  2.2  (p.  77)  on  bank  wages).  In  Chapter  2  the  vari¬ 
ations  in  salaries  (measured  in  logarithms)  were  explained  by  variations  in 
education  of  the  employees.  As  can  be  observed  from  the  scatter  diagram  in 
Exhibit  2.5(a)  (p.  85)  and  the  regression  results  in  Exhibit  2.6  (p.  86),  around 
half  of  the  variability  (as  measured  by  the  variance)  can  be  explained  in  this 
way.  Of  course,  the  salary  of  an  employee  is  not  only  determined  by  the 
number  of  years  of  education  because  many  other  variables  also  play  a  role. 
Apart  from  salary  and  education,  the  following  data  are  available  for  each 
employee:  begin  or  starting  salary  (the  salary  that  the  individual  earned  at  his 
or  her  first  position  at  this  bank),  gender  (with  value  zero  for  females  and  one 
for  males),  ethnic  minority  (with  value  zero  for  non-minorities  and  value  one 
for  minorities),  and  job  category  (category  1  consists  of  administrative  jobs, 
category  2  of  custodial  jobs,  and  category  3  of  management  jobs).  The  begin 
salary  can  be  seen  as  an  indication  of  the  qualities  of  the  employee  that, 
apart  from  education,  are  determined  by  previous  experience,  personal  char¬ 
acteristics,  and  so  on.  The  other  variables  may  also  affect  the  earned  salary. 

Simple  regression  may  be  misleading 

Of  course,  the  effect  of  each  variable  could  be  estimated  by  a  simple  regres¬ 
sion  of  salaries  on  each  explanatory  variable  separately.  For  the  explanatory 
variables  education,  begin  salary,  and  gender,  the  scatter  diagrams  with 
regression  lines  are  shown  in  Exhibit  3.1  (a-c).  However,  these  results  may 
be  misleading,  as  the  explanatory  variables  are  mutually  related.  For 
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(a) 


LOGSAL  vs.  EDUC 


( C ) 

LOGSAL  vs.  GENDER 


GENDER 


(b) 


LOGSAL  vs.  LOGSALBEGIN 


LOGSALBEGIN 


(d) 


LOGSALBEGIN  vs.  EDUC 


(. f ) 

EDUC  vs.  GENDER  LOGSALBEGIN  vs.  GENDER 


Exhibit  3.1  Scatter  diagrams  of  Bank  Wage  data 


Scatter  diagrams  with  regression  lines  for  several  bivariate  relations  between  the  variables 
LOGSAL  (logarithm  of  yearly  salary  in  dollars),  EDUC  (finished  years  of  education), 
LOGSALBEGIN  (logarithm  of  yearly  salary  when  employee  entered  the  firm)  and  GENDER 
(0  for  females,  1  for  males),  for  474  employees  of  a  US  bank. 
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example,  the  gender  effect  on  salaries  (c)  is  partly  caused  by  the  gender  effect 
on  education  (e).  Similar  relations  between  the  explanatory  variables  are 
shown  in  ( d )  and  ( f ).  This  mutual  dependence  is  taken  into  account  by 
formulating  a  multiple  regression  model  that  contains  more  than  one  ex¬ 
planatory  variable. 


3.1.2  Least  squares 

Uses  Appendix  A.  7. 


Regression  model  in  matrix  form 

The  linear  model  with  several  explanatory  variables  is  given  by  the  equation 

V,  =  Pi  +  02 x2i  +  03x3 i  +  '  '  '  +  PkXki  +  £i  (z  =  lj ‘ ‘  ‘  j  n) ■  (3-1) 


From  now  on  we  follow  the  convention  that  the  constant  term  is  denoted  by 
01  rather  than  a.  The  first  explanatory  variable  X\  is  defined  by  X\ ,•  =  1  for 
every  i  =  1,  ■  ■  ■ ,  n,  and  for  simplicity  of  notation  we  write  01  instead  of  0\X\i. 
For  purposes  of  analysis  it  is  convenient  to  express  the  model  (3.1)  in  matrix 
form.  Let 


1  yi  ^ 

( 

\yj 

,x  = 

V 

%21  '  '  ’  %kl 

%2n  '  '  '  %kn 


,P  = 


(Pi 

\0k 


(3.2) 


Note  that  in  the  n  x  k  matrix  X  =  (x/;)  the  first  index  /  (/  =  1,  ■  ■  ■ ,  k)  refers  to 
the  variable  number  (in  columns)  and  the  second  index  i  (i  =  1,  ■  ■  ■ ,  n)  refers 
to  the  observation  number  (in  rows).  The  notation  in  (3.2)  is  common  in 
econometrics  (whereas  in  books  on  linear  algebra  the  indices  i  and  /  are  often 
reversed).  In  our  notation,  we  can  rewrite  (3.1)  as 


y  =  Xp  +  e.  (3.3) 

Here  f  is  a  k  x  1  vector  of  unknown  parameters  and  e  is  an  n  x  1  vector  of 
unobserved  disturbances. 


Residuals  and  the  least  squares  criterion 

If  b  is  a  k  x  1  vector  of  estimates  of  then  the  estimated  model  may  be 
written  as 
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y  =  Xb  +  e.  (3.4) 

Here  e  denotes  the  n  x  1  vector  of  residuals,  which  can  be  computed  from  the 
data  and  the  vector  of  estimates  b  by  means  of 

e  =  y  —  Xb.  (3.5) 

We  denote  transposition  of  matrices  by  primes  (')  —  for  instance,  the  trans¬ 
pose  of  the  residual  vector  e  is  the  1  x  n  matrix  e'  =  {e\ ,  •  ■  ■ ,  e„).  To  deter¬ 
mine  the  least  squares  estimator,  we  write  the  sum  of  squares  of  the  residuals 
(a  function  of  b)  as 

S(b)  =  ^  ej  =  e' e  =  {y  -  Xb)' (y  -  Xb) 

=  y'y  -  y'Xb  -  b'X'y  +  b'X'Xb. 


Derivation  of  least  squares  estimator 

The  minimum  of  S(b)  is  obtained  by  setting  the  derivatives  of  S(b)  equal  to  zero. 
Note  that  the  function  S(b)  has  scalar  values,  whereas  b  is  a  column  vector  with  k 
components.  So  we  have  k  first  order  derivatives  and  we  will  follow  the  convention 
to  arrange  them  in  a  column  vector.  The  second  and  third  terms  of  the  last  expres¬ 
sion  in  (3.6)  are  equal  (a  1  x  1  matrix  is  always  symmetric)  and  may  be  replaced  by 
—2 b'X'y.  This  is  a  linear  expression  in  the  elements  of  b  and  so  the  vector  of 
derivatives  equals  —2X’y.  The  last  term  of  (3.6)  is  a  quadratic  form  in  the  elements 
of  b.  The  vector  of  first  order  derivatives  of  this  term  b'X'Xb  can  be  written  as 
2 X'Xb.  The  proof  of  this  result  is  left  as  an  exercise  ( see  Exercise  3 . 1 ) .  To  get  the  idea 
we  consider  the  case  k  =  2  and  we  denote  the  elements  of  X'X  by  c,;,  i,  j  =  1 ,  2, 
with  c\2  =  C2i-  Then  b'X'Xb  =  c\\b\  +  C2ib\  +  2cnbib2-  The  derivative  with  re¬ 
spect  to  b\  is  2c nb i -\- 2c  12b 2  and  the  derivative  with  respect  to  f?2  is 
2cubi  +  2c22^2-  When  we  arrange  these  two  partial  derivatives  in  a  2  x  1  vector, 
this  can  be  written  as  2 X'Xb.  See  Appendix  A  (especially  Examples  A.  f  0  and  A.  1 1 
in  Section  A. 7)  for  further  computational  details  and  illustrations. 


The  least  squares  estimator 

Combining  the  above  results,  we  obtain 

-r  =  -2X'y  +  2X'Xb.  (3.7) 

ob 

The  least  squares  estimator  is  obtained  by  minimizing  S(b).  Therefore  we  set 
these  derivatives  equal  to  zero,  which  gives  the  normal  equations 


X'Xb  =  X'y. 


(3.8) 
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Solving  this  for  b,  we  obtain 


b  =  (X'Xj-'X'y  (3.9) 

provided  that  the  inverse  of  X'X  exists,  which  means  that  the  matrix  X 
should  have  rank  k.  As  X  is  an  n  x  k  matrix,  this  requires  in  particular  that 
n  >  k  —  that  is,  the  number  of  parameters  is  smaller  than  or  equal  to  the 
number  of  observations.  In  practice  we  will  almost  always  require  that  k  is 
considerably  smaller  than  n. 


Proof  of  minimum 

From  now  on,  if  we  write  b,  we  always  mean  the  expression  in  (3.9).  This  is  the 
classical  formula  for  the  least  squares  estimator  in  matrix  notation.  If  the  matrix  X 
has  rank  k,  it  follows  that  the  Hessian  matrix 


d2S 

dbdb’ 


=  2X'X 


(3.10) 


is  a  positive  definite  matrix  (see  Exercise  3.2).  This  implies  that  (3.9)  is  indeed 
the  minimum  of  (3.6).  In  (3.10)  we  take  the  derivatives  of  a  vector  (||)  with 
respect  to  another  vector  ( b ')  and  we  follow  the  convention  to  arrange  these 
derivatives  in  a  matrix  (see  Exercise  3.2).  An  alternative  proof  that  b  minimizes 
the  sum  of  squares  (3.6)  that  makes  no  use  of  first  and  second  order  derivatives  is 
given  in  Exercise  3.3. 


Summary  of  computations 

The  least  squares  estimates  can  be  computed  as  follows. 


Least  squares  estimation 

•  Step  1:  Choice  of  variables.  Choose  the  variable  to  be  explained  (y)  and  the 
explanatory  variables  (xi,  •••,*£,  where  Xi  is  often  the  constant  that 
always  takes  the  value  1). 

•  Step  2:  Collect  data.  Collect  n  observations  of  y  and  of  the  related  values  of 
xi,  •  •  • ,  Xk  and  store  the  data  of  y  in  an  n  x  1  vector  and  the  data  on  the 
explanatory  variables  in  the  n  x  k  matrix  X. 

•  Step  3:  Compute  the  estimates.  Compute  the  least  squares  estimates  by  the 
OLS  formula  (3.9)  by  using  a  regression  package. 


Exercises:  T:  3.1,  3.2. 
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3.1.3  Geometric  interpretation 

Uses  Sections  1.2.2,  1.2.3;  Appendix  A. 6. 


Least  squares  seen  as  projection 

The  least  squares  method  can  be  given  a  geometric  interpretation,  which  we 
discuss  now.  Using  the  expression  (3.9)  for  b,  the  residuals  may  be  written  as 

e  =  y~Xb  =  y-X[X'X)-xX'y  =  My  (3.11) 

where 

M  =  I-X(X,X)~1X'.  (3.12) 

The  matrix  M  is  symmetric  (AT  =  M)  and  idempotent  (M2  =  M).  Since  it 
also  has  the  property  MX  =  0,  it  follows  from  (3.11)  that 


X'e  =  0.  (3.13) 

We  may  write  the  explained  component  y  of  y  as 

y  =  Xb  =  Hy  (3.14) 

where 

H  =  X(X,X)~1X'  (3.15) 

is  called  the  ‘hat  matrix’,  since  it  transforms  y  into  y  (pronounced:  ‘y-hat’). 
Clearly,  there  holds  H'  =  H,  H 1  =  H,  H  +  M  =  I  and  HM  =  0.  So 

y  =  Hy  +  My  =  y  +  e 

where,  because  of  (3.11)  and  (3.13),  y'e  =  0,  so  that  the  vectors  y  and  e 
are  orthogonal  to  each  other.  Therefore,  the  least  squares  method  can  be 
given  the  following  interpretation.  The  sum  of  squares  e'e  is  the  square  of 
the  length  of  the  residual  vector  e  =  y  —  Xb.  The  length  of  this  vector  is 
minimized  by  choosing  Xb  as  the  orthogonal  projection  of  y  onto  the  space 
spanned  by  the  columns  of  X.  This  is  illustrated  in  Exhibit  3.2.  The  projec¬ 
tion  is  characterized  by  the  property  that  e  =  y  —  Xb  is  orthogonal  to 
all  columns  of  X,  so  that  0  =  X'e  =  X'(y  —  Xb).  This  gives  the  normal 
equations  (3.8). 
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Exhibit  3.2  Least  squares 

Three-dimensional  geometric  impression  of  least  squares,  the  vector  of  observations  on 
the  dependent  variable  y  is  projected  onto  the  plane  of  the  independent  variables  X  to  obtain 
the  linear  combination  Xb  of  the  independent  variables  that  is  as  close  as  possible  to  y. 

Geometry  of  least  squares 

Let  S(X)  be  the  space  spanned  by  the  columns  of  X  (that  is,  the  set  of  all  n  x  1 
vectors  that  can  be  written  as  Xa  for  some  k  x  1  vector  a)  and  let  Sx  (X)  be  the  space 
orthogonal  to  S(X)  (that  is,  the  set  of  all  n  x  1  vectors  z  with  the  property  that 
X'z  —  0).  The  matrix  H  projects  onto  S(X)  and  the  matrix  M  projects  onto  Sx(X). 
In  y  =  y  +  e,  the  vector  y  is  decomposed  into  two  orthogonal  components,  with 
y  €  S(X)  according  to  (3.14)  and  e  €  Sx(X)  according  to  (3.13).  The  essence  of  this 
decomposition  is  given  in  Exhibit  3.3,  which  can  be  seen  as  a  two-dimensional 
version  of  the  three-dimensional  picture  in  Exhibit  3.2. 


Geometric  interpretation  as  a  tool  in  analysis 

This  geometric  interpretation  can  be  helpful  to  understand  some  of  the  algebraic 
properties  of  least  squares.  As  an  example  we  consider  the  effect  of  applying  linear 
transformations  on  the  set  of  explanatory  variables.  Suppose  that  the  n  x  k  matrix 
X  is  replaced  by  X*  =  XA  where  A  is  a  k  x  k  invertible  matrix.  Then  the  least 
squares  fit  (y),  the  residuals  (e),  and  the  projection  matrices  (H  and  M)  remain 
unaffected  by  this  transformation.  This  is  immediately  evident  from  the  geometric 
pictures  in  Exhibits  3.2  and  3.3,  as  S(Xt)  =  S(X). 


Exhibit  3.3  Least  squares 

Two-dimensional  geometric  impression  of  least  squares  where  the  ^-dimensional  plane  S(X)  is 
represented  by  the  horizontal  line,  the  vector  of  observations  on  the  dependent  variable  y  is 
projected  onto  the  space  of  the  independent  variables  S(X)  to  obtain  the  linear  combination  Xb 
of  the  independent  variables  that  is  as  close  as  possible  to  y. 
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The  properties  can  also  be  checked  algebraically,  by  working  out  the  expres¬ 
sions  for  y,  e,  H,  and  M  in  terms  of  X*.  The  least  squares  estimates  change  after 
the  transformation,  as  b*  =  (X,t:Xt)~1X'ty  =  A~lb.  For  example,  suppose  that 
the  variable  x *  is  measured  in  dollars  and  x*k  is  the  same  variable  measured  in 
thousands  of  dollars.  Then  x*ki  =  Xf,;/1000  for  i  =  1,  •  •  • ,  n,  and  X*  =  XA  where  A 
is  the  diagonal  matrix  diag(l,  ■  •  • ,  1, 0.001).  The  least  squares  estimates  of  /?.  for 
j  /  k  remain  unaffected  —  that  is,  b*  =  bj  for  j  ^  k,  and  b*k  =  1000 b^.  This  also 
makes  perfect  sense,  as  one  unit  increase  in  x*k  corresponds  to  an  increase  of  a 
thousand  units  in  x&. 

Exercises:  T:  3.3. 


3.1.4  Statistical  properties 

Uses  Sections  1.2.2,  1.3.2. 


Seven  assumptions  on  the  multiple  regression  model 

To  analyse  the  statistical  properties  of  least  squares  estimation,  it  is  conveni¬ 
ent  to  use  as  conceptual  background  again  the  simulation  experiment  de¬ 
scribed  in  Section  2.2.1  (p.  87-8).  We  first  restate  the  seven  assumptions  of 
Section  2.2.3  (p.  92)  for  the  multiple  regression  model  (3.3)  and  use  the 
matrix  notation  introduced  in  Section  3.1.2. 

•  Assumption  1:  fixed  regressors.  All  elements  of  the  n  x  k  matrix  X  con¬ 
taining  the  observations  on  the  explanatory  variables  are  non-stochastic.  It 
is  assumed  that  n  >  k  and  that  the  matrix  X  has  rank  k. 

•  Assumption  2:  random  disturbances,  zero  mean.  The  n  x  1  vector  e  con¬ 
sists  of  random  disturbances  with  zero  mean  so  that  E[e\  =  0,  that  is, 
E[si]  =  0  (i=l,---,n). 

•  Assumption  3:  homoskedasticity.  The  covariance  matrix  of  the  disturb¬ 
ances  E[ss'\  exists  and  all  its  diagonal  elements  are  equal  to  a2,  that  is, 
E[e2]  =  a2  (i=  1,  •••,«). 

•  Assumption  4:  no  correlation.  The  off-diagonal  elements  of  the  covariance 
matrix  of  the  disturbances  E[es'j  are  all  equal  to  zero,  that  is,  E[e;e,]  =  0  for 
all  i  j. 

•  Assumption  5:  constant  parameters.  The  elements  of  the  k  x  1  vector  ft 
and  the  scalar  a  are  fixed  unknown  numbers  with  a  >  0. 

•  Assumption  6:  linear  model.  The  data  on  the  explained  variable  y  have 
been  generated  by  the  data  generating  process  (DGP) 


y  =  Xf}  +  e. 


(3.16) 
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•  Assumption  7:  normality.  The  disturbances  are  jointly  normally  distrib¬ 
uted. 

Assumptions  3  and  4  can  be  summarized  in  matrix  notation  as 

E[se']  =  (j2I,  (3.17) 

where  I  denotes  the  n  x  n  identity  matrix.  If  in  addition  Assumption  7  is 
satisfied,  then  s  follows  the  multivariate  normal  distribution 

e  ~  N(0,  a2I). 

Assumptions  4  and  7  imply  that  the  disturbances  s„i  =  1,  •  •  • ,  n  are  mutually 
independent. 

Least  squares  is  unbiased 

The  expected  value  of  b  is  obtained  by  using  Assumptions  1,  2,  5,  and  6. 
Assumption  6  implies  that  the  least  squares  estimator  b  =  (X'Xp'X'y  can  be 
written  as 


b  =  (X'Xr'X'iXp  +  g)  =  p  +  (X'X)-1X'e. 

Taking  expectations  is  a  linear  operation  —  that  is,  if  Z\  and  Zi  are  two 
random  variables  and  A\  and  A 2  are  two  non-random  matrices  of 
appropriate  dimensions  so  that  z  =  A\Z\  +  A2Z2  is  well  defined,  then 
£[z]  =  AiE^]  +  A2E[zi]-  From  Assumptions  1,  2,  and  5  we  obtain 

E[b]  =  E[p  +  (X'X)_1X'e]  =  p  +  (X'X)_1X'E[£]  =  p.  (3.18) 

So  b  is  unbiased. 

The  covariance  matrix  of  b 

Using  the  result  (3.18),  we  obtain  that  under  Assumptions  1-6  the  covariance 
matrix  of  b  is  given  by 

var(fi)  =  E[(b  -  P)(b  -  p)']  =  E[(X'X)-1X,££,X(X'X)-1] 

=  (X,X)“1X'E[££,]X(X,X)_1  =  (X,X)_1X/(<t2J)X(X'X)_1 
=  u2(X'X)-1.  (3.19) 

The  diagonal  elements  of  this  matrix  are  the  variances  of  the  estimators  of 
the  individual  parameters,  and  the  off-diagonal  elements  are  the  covariances 
between  these  estimators. 
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Least  squares  is  best  linear  unbiased 

The  Gauss-Markov  theorem,  proved  in  Section  2.2.5  (p.  97-8)  for  the  simple 
regression  model,  also  holds  for  the  more  general  model  (3.16).  It  states  that, 
among  all  linear  unbiased  estimators,  b  has  minimal  variance  —  that  is,  b  is 
the  best  linear  unbiased  estimator  (BLUE)  in  the  sense  that,  if  /?  =  Ay  with  A 
a  k  x  n  non-stochastic  matrix  and  £[/?]  =  /?,  then  var(/?)  —  var (b)  is  a  positive 
semidefinite  matrix.  This  means  that  for  every  k  x  1  vector  c  of  constants 
there  holds  c'(var(/j)  —  var {b))c  >  0,  or,  equivalently,  var (c'b)  <  var (c'P). 
Choosing  for  c  the  ;th  unit  vector,  this  means  in  particular  that  for  the  ;th 
component  var (bj)  <  var (fy)  so  that  the  least  squares  estimators  are  efficient. 
This  result  holds  true  under  Assumptions  1-6,  the  assumption  of  normality  is 
not  needed. 


Proof  of  Gauss-Markov  theorem 

To  prove  the  result,  first  note  that  the  condition  that  E[p\  =  E[Ay]  = 
AE[y]  =  AXp  =  P  for  all  P  implies  that  AX  =  I,  the  k  x  k  identity  matrix.  Now 
define  D  =  A  -  (X'X)_1X',  then  DX  =  AX  -  {X'X)~lX'X  =  7  -  7  =  0  so  that 


var  (P)  =  var  (Ay)  =  var(Ae)  =  a2AA'  =  a2DD'  +  cr2(X'X)  1, 

where  the  last  equality  follows  by  writing  A  =  D  +  (X'X)_1X'  and  working  out 
AA'.  This  shows  that  var (/))  —  var(h)  =  a2DD' ,  which  is  positive  semidefinite,  and 
zero  if  and  only  if  D  =  0  —  that  is,  A  =  (X'X)_1X'.  So  P  =  b  gives  the  minimal 
variance. 


Exercises:  T:  3.4. 


3.1.5  Estimating  the  disturbance  variance 


Derivation  of  unbiased  estimator 

Next  we  consider  the  estimation  of  the  unknown  variance  a2.  As  in  the  previous 
chapter  we  make  use  of  the  sum  of  squared  residuals  e' e.  Intuition  could  suggest  to 
estimate  a2  =  E[e2]  by  the  sample  average  ej  —  \  e'e,  but  this  estimator  is  not 
unbiased.  It  follows  from  (3.11)  and  (3.16)  and  the  fact  that  MX  =  0  that 
e  =  My  =  M(Xp  +  e)  =  Me.  So 


E[e\  =  0, 


(3.20) 


var(e)  =  E[ee']  =  E[Mee!M]  =  ME[ee']M  =  c2M2  =  a2M. 


(3.21) 
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To  evaluate  E[e'e \  it  is  convenient  to  use  the  trace  of  a  square  matrix,  which  is 
defined  as  the  sum  of  the  diagonal  elements  of  this  matrix.  Because  the  trace  and 
the  expectation  operator  can  be  interchanged,  we  find,  using  the  property  that 
tr (AB)  =  tr {BA),  that 

E[e'e]  =  E[tr(ee')  ]  =  tr [E[eer] )  =  u2tr(M). 

Using  the  property  that  tr(A  +  B)  =  tr  (A)  +  tr(B)  we  can  simplify  this  as 

tr(M)  =  tr  (I„  -  X(X'X)_1X')  =  n-  tr(X(X'X)_1X') 

=  n-  tr(X'X(X'X)_1)  =  n-  tr  (Ik)  =  n  -  k, 

where  the  subscripts  denote  the  order  of  the  identity  matrices. 


The  least  squares  estimator  s2  and  standard  errors 

This  shows  that  E[e’e ]  =  (n  —  k)o2  so  that 


s 


2 


e'e 

n  —  k 


(3.22) 


is  an  unbiased  estimator  of  a2.  The  square  root  s  of  (3.22)  is  called  the 
standard  error  of  the  regression.  If  in  the  expression  (3.19)  we  replace  a1 
by  s2  and  if  we  denote  the  ;th  diagonal  element  of  (X'X)-1  by  a]p  then  s^fafj  is 
called  the  standard  error  of  the  estimated  coefficient  bj.  This  is  an  estimate  of 
the  standard  deviation  o^afj  of  bj. 


Intuition  for  the  factor  1/(n  -  k) 

The  result  in  (3.22)  can  also  be  given  a  more  intuitive  interpretation.  Suppose 
we  would  try  to  explain  y  by  a  matrix  X  with  k  =  n  columns  and  rank  k. 
Then  we  would  obtain  e  =  0,  a  perfect  fit,  but  we  would  not  have  obtained 
any  information  on  a2.  Of  course  this  is  an  extreme  case.  In  practice  we 
confine  ourselves  to  the  case  k  <  n.  The  very  fact  that  we  choose  b  in  such  a 
way  that  the  sum  of  squared  residuals  is  minimized  is  the  cause  of  the  fact 
that  the  squared  residuals  are  smaller  (on  average)  than  the  squared  disturb¬ 
ances.  Let  us  consider  a  diagonal  element  of  (3.21), 

var(<?,)  =  ff2(l  —  hi),  (3.23) 

where  h ,  is  the  zth  diagonal  element  of  the  matrix  H  =  7  —  M  in  (3.15).  As  H 
is  positive  semidefinite,  it  follows  that  h,  >  0.  If  the  model  contains  a  con¬ 
stant  term  (so  that  the  matrix  X  contains  a  column  of  ones),  then  hj  >  0  (see 
Exercise  3.7).  So  each  single  element  e,  of  the  residual  vector  has  a  variance 
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that  is  smaller  than  a1,  and  therefore  the  sum  of  squares  ej  has  an  expected 
value  less  than  no1.  This  effect  becomes  stronger  when  we  have  more 
parameters  to  obtain  a  good  fit  for  the  data.  If  one  would  like  to  use  a 
small  residual  variance  as  a  criterion  for  a  good  model,  then  the  denominator 
{n  —  k)  of  the  estimator  (3.22)  gives  an  automatic  penalty  for  choosing 
models  with  large  k. 

Intuition  for  the  number  of  degrees  of  freedom  (n  -  k) 

As  e  =  Mb,  it  follows  under  Assumptions  1-7  that  e'e/er2  =  e'Mb/o 2  follows 
the  ^-distribution  with  (n  —  k)  degrees  of  freedom.  This  follows  from  the 
results  in  Section  1.2.3  (p.  32),  using  the  fact  that  M  is  an  idempotent  matrix 
with  rank  (n  —  k).  The  term  degrees  of  freedom  refers  to  the  restrictions 
X' e  =  0.  We  may  partition  this  as  X\e\  +  X2ez  =  0,  where  X)  is  a  k  x  (n  —  k) 
matrix  and  X2  a  k  x  k  matrix.  If  the  matrix  X'2  has  a  rank  less  than  k,  we  may 
rearrange  the  columns  of  X'  in  such  a  way  that  X'2  has  rank  k.  The  restric¬ 
tions  imply  that,  once  we  have  freely  chosen  the  n  —  k  elements  of  e\,  the 
remaining  elements  are  dictated  by  ei  =  —{X'2)~lX'xe\.  This  is  also  clear  from 
Exhibit  3.3.  For  given  matrix  X  of  explanatory  variables,  the  residual  vector 
lies  in  Sx(X)  and  this  space  has  dimension  ( n  —  k).  That  is,  k  degrees  of 
freedom  are  lost  because  f  has  been  estimated. 

Exercises:  T:  3.5,  3.7a. 


3.1.6  Coefficient  of  determination 


Derivation  of  R2 

The  performance  of  least  squares  can  be  evaluated  by  the  coefficient  of  determin¬ 
ation  R2  —  that  is,  the  fraction  of  the  total  sample  variation  (y,-  —  y)2  that  is 
explained  by  the  model. 

In  matrix  notation,  the  total  sample  variation  can  be  written  as  y'Ny  with 


N  =  I  —  -n', 
n 

where  i  =  (1,  •  •  • ,  1)'  is  the  nxl  vector  of  ones.  The  matrix  N  has  the  property 
that  it  takes  deviations  from  the  mean,  as  the  elements  of  Ny  are  y,-  —  y.  Note  that 
N  is  a  special  case  of  an  M-matrix  (3.12)  with  X  =  i,  as  i'i  =  n.  So  Ny  can  be 
interpreted  as  the  vector  of  residuals  and  y'Ny  =  (Ny)'Ny  as  the  residual  sum  of 
squares  from  a  regression  where  y  is  explained  by  X  =  ;.  If  X  in  the  multiple 
regression  model  (3.3)  contains  a  constant  term,  then  the  fact  that  X'e  =  0 
implies  that  1'e  =  0  and  hence  Ne  =  e.  From  y  =  Xb  +  e  we  then  obtain 
Ny  =  NXb  +  Ne  =  NXb  +  e  =  ‘explained’  +  ‘residual’,  and 
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y'Ny  =  (Ny)'Ny  =  (NXb  +  e)'  (NXb  +  e) 

=  b'X'NXb  +  e'e. 

Here  the  cross  term  vanishes  because  b'X’Ne  =  0,  as  Ne  =  e  and  X'e  =  0.  It 
follows  that  the  total  variation  in  y  ( SST )  can  be  decomposed  in  an  explained 
part  SSE  =  b'X'NXb  and  a  residual  part  SSR  =  e'e. 

Coefficient  of  determination:  R2 

Therefore  R 1  is  given  by 

2_  SSE  b'X’NXb  e'e  SSR 

SST  y'Ny  y'Ny  SST'  [  ' 

The  third  equality  in  (3.24)  holds  true  if  the  model  contains  a  constant  term. 
If  this  is  not  the  case,  then  SSR  may  be  larger  than  SST  (see  Exercise  3.7)  and 
R2  is  defined  as  SSE/SST  (and  not  as  1  —  SSR/ SST).  If  the  model  contains  a 
constant  term,  then  (3.24)  shows  that  0  <  R2  <  1.  It  is  left  as  an  exercise  (see 
Exercise  3.7)  to  show  that  R2  is  the  squared  sample  correlation  coefficient 
between  y  and  its  explained  part  y  =  Xb.  In  geometric  terms,  R  (the  square 
root  of  R2)  is  equal  to  the  length  of  NXb  divided  by  the  length  of  Ny  —  that 
is,  R  is  equal  to  the  cosine  of  the  angle  between  Ny  and  NXb.  This  is 
illustrated  in  Exhibit  3.4.  A  good  fit  is  obtained  when  Ny  is  close  to 
NXb  —  that  is,  when  the  angle  between  these  two  vectors  is  small.  This 
corresponds  to  a  high  value  of  R2. 

Adjusted  R2 

When  explanatory  variables  are  added  to  the  model,  then  R2  never  decreases 
(see  Exercise  3.6).  The  wish  to  penalize  models  with  large  k  has  motivated  an 
adjusted  R2  defined  by  adjusting  for  the  degrees  of  freedom. 


Exhibit  3.4  Geometric  picture  of  R 2 

Two-dimensional  geometric  impression  of  the  coefficient  of  determination.  The  dependent 
variable  and  all  the  independent  variables  are  taken  in  deviation  from  their  sample  means,  with 
resulting  vector  of  dependent  variables  Ny  and  matrix  of  independent  variables  NX.  The 
explained  part  of  Ny  is  NXb  with  residuals  Ne  =  e,  and  the  coefficient  of  determination  is 
equal  to  the  square  of  the  cosine  of  the  indicated  angle  q>. 
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^2  =  _  e'e/(n  -  k) 

y'Ny/(n  —  1) 


=  1  - 


n  —  1 
n  —  k 


(1  —  R2). 


(3.25) 


“S3  Exercises:  T:  3.6a,  b,  3.7b,  c. 


3.1.7  Illustration:  Bank  Wages 

To  illustrate  the  foregoing  results  we  consider  the  data  on  salary  and  educa¬ 
tion  discussed  earlier  in  Chapter  2  and  in  Section  3.1.1.  We  will  discuss  (i)  the 
data,  (ii)  the  model,  (iii)  the  normal  equations  and  the  least  squares  estimates, 
(iv)  the  interpretation  of  the  estimates,  (v)  the  sums  of  squares  and  R 2,  and 
(vi)  the  orthogonality  of  residuals  and  explanatory  variables. 

(i)  Data 

The  data  consist  of  a  cross  section  of  474  individuals  working  for  a  US  bank. 
For  each  employee,  the  information  consists  of  the  following  variables: 
salary  (S),  education  (#2),  begin  salary  (B),  gender  (X4  =  0  for  females, 
X4  =  1  for  males),  minority  (*5  =  1  if  the  individual  belongs  to  a  minority 
group,  xs  =  0  otherwise),  job  category  (x&  =  1  for  clerical  jobs,  X(,  =  2  for 
custodial  jobs,  and  xg  =  3  for  management  positions),  and  some  further  job- 
related  variables. 

(ii)  Model 

As  a  start,  we  will  consider  the  model  with  y  =  log  (S)  as  variable  to  be 
explained  and  with  xi  and  X3  =  log(B)  as  explanatory  variables.  That  is, 
we  consider  the  regression  model 


yt  =  Pi  +  fhx2i  +  hxa  +  Sj  (i=  1,  •  •  • , «). 

(iii)  Normal  equations  and  least  squares  estimates 

As  before,  to  simplify  the  notation  we  define  the  first  regressor  by  Xu  =  1.  The 
normal  equations  (3.8)  involve  the  cross  product  terms  X'X  and  X'y.  For  the 
data  at  hand  they  are  given  (after  rounding)  in  Exhibit  3.5,  Panel  1.  Solving 
the  normal  equations  (3.8)  gives  the  least  squares  estimates 
shown  in  Panel  3  in  Exhibit  3.5,  so  that  (after  rounding)  b\  =  1.647, 
=  0.023,  and  63  =  0.869.  It  may  be  checked  from  the  cross  products  in 
Panel  1  in  Exhibit  3.5  thatX'Xb  =  X'y  (apart  from  rounding  errors)  — that  is, 

/  474  6395  4583  \  /  1.647 \  /  4909  \ 

6395  90215  62166  0.023  =  66609  . 

\4583  62166  44377/  \ 0.869 /  \ 47527/ 
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Panel  1 

IOTA 

LOGSAT 

EDUC 

LOGSALBEGIN 

IOTA 

474 

LOGSAT 

4909 

50917 

EDUC 

6395 

66609 

90215 

LOGSALBEGIN 

4583 

47527 

62166 

44377 

Panel  2 

LOGSAL 

EDUC 

LOGSALBEGIN 

LOGSAL 

1.000000 

EDUC 

0.696740 

1.000000 

LOGSALBEGIN 

0.886368 

0.685719 

1.000000 

Panel  3:  Dependent  Variable:  LOGSAL 

Method:  Least  Squares 

Sample:  1  474 

Included  observations:  474 

Variable 

Coefficient 

Std.  Error 

C 

1.646916 

0.274598 

EDUC 

0.023122 

0.003894 

LOGSALBEGIN 

0.868505 

0.031835 

R-squared 

0.800579 

Adjusted  R-squared 

0.799733 

S.E.  of  regression 

0.177812 

Sum  squared  resid 

14.89166 

Total  sum  of  squares 

74.67462 

Explained  sum  of  squares 

59.78296 

Panel  4:  Dependent  Variable:  RESID 
Method:  Least  Squares 

Sample:  1  474 

Included  observations:  474 

Variable 

Coefficient 

C 

3.10E-11 

EDUC 

2.47E-13 

LOGSALBEGIN 

— 3.55E-12 

R-squared 

Adjusted  R-squared 

S.E.  of  regression 

Sum  squared  resid 

0.000000 

-0.004246 

0.177812 

14.89166 

Exhibit  3.5  Bank  Wages  (Section  3.1.7) 

Panel  1  contains  the  cross  product  terms  (X'X  and  X'y)  of  the  variables  (iota  denotes  the 
constant  term  with  all  values  equal  to  one),  Panel  2  shows  the  correlations  between  the 
dependent  and  the  two  independent  variables,  and  Panel  3  shows  the  outcomes  obtained  by 
regressing  salary  (in  logarithms)  on  a  constant  and  the  explanatory  variables  education  and  the 
logarithm  of  begin  salary.  The  residuals  of  this  regression  are  denoted  by  RESID,  and  Panel  4 
shows  the  result  of  regressing  these  residuals  on  a  constant  and  the  two  explanatory  variables 
(3.10E-11  means  3.10*10~n,  and  so  on;  these  values  are  zero  up  to  numerical  rounding). 


(iv)  Interpretation  of  estimates 

A  first  thing  to  note  here  is  that  the  marginal  relative  effect  of  education  on 
wage  (that  is,  =  SJ  =  Pi)  is  estimated  now  as  0.023,  whereas  in 

Chapter  2  this  effect  was  estimated  as  0.096  with  a  standard  error  of  0.005 
(see  Exhibit  2.11,  p.  103).  This  is  a  substantial  difference.  That  is,  an 
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additional  year  of  education  corresponds  on  average  with  a  9.6  per  cent 
increase  in  salary.  But,  if  the  begin  salary  is  ‘kept  fixed’,  an  additional  year  of 
education  gives  only  a  2.3  per  cent  increase  in  salary.  The  cause  of  this 
difference  is  that  the  variable  ‘begin  salary’  is  strongly  related  to  the  variable 
‘education’.  This  is  clear  from  Panel  2  in  Exhibit  3.5,  which  shows  that  xi 
and  X3  have  a  correlation  of  around  69  per  cent.  We  refer  also  to  Exhibit  3.1 
(d),  which  shows  a  strong  positive  relation  between  X2  and  X3.  This  means 
that  in  Chapter  2,  where  we  have  excluded  the  begin  salary  from  the  model, 
part  of  the  positive  association  between  education  and  salary  is  due  to  a  third 
variable,  begin  salary.  This  explains  why  the  estimated  effect  in  Chapter  2  is 
larger. 

(v)  Sums  of  squares  and  R2 

The  sums  of  squares  for  this  model  are  reported  in  Panel  3  in  Exhibit  3.5, 
with  values  SST  =  74.675,  SSE  =  59.783,  and  SSR  =  14.892,  so  that 
R 2  =  0.801.  This  is  larger  than  the  R2  =  0.485  in  Chapter  2  (see  Exhibit 
2.6,  p.  86).  In  Section  3.4  we  will  discuss  a  method  to  test  whether  this  is  a 
significant  increase  in  the  model  fit.  Panel  3  in  Exhibit  3.5  also  reports  the 
standard  error  of  the  regression  s  =  ^ SSR/ (474  —  3)  =  0.178  and  the  stand¬ 
ard  error  of  bi  0.0039. 

(vi)  Orthogonality  of  residuals  and  explanatory  variables 

Panel  4  in  Exhibit  3.5  shows  the  result  of  regressing  the  least  squares 
residuals  on  the  variables  xi,  X2,  and  X3.  This  gives  an  R2  =  0,  which  is  in 
accordance  with  the  property  that  the  residuals  are  uncorrelated  with  the 
explanatory  variables  in  the  sense  that  X'e  =  0  (see  Exhibits  3.2  and  3.4). 
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3.2  Adding  or  deleting  variables 

Uses  Appendix  A.2-A.4. 


Choice  of  the  number  of  explanatory  variables 

To  make  an  econometric  model  we  have  to  decide  which  variables  provide 
the  best  explanation  of  the  dependent  variable.  That  is,  we  have  to  decide 
which  explanatory  variables  should  be  included  in  the  model.  In  this 
section  we  analyse  what  happens  if  we  add  variables  to  our  model  or  delete 
variables  from  our  model.  This  is  illustrated  in  the  scheme  in  Exhibit  3.6, 
where  X\  and  Xi  denote  two  subsets  of  variables.  Here  X\  is  included  in 
the  model,  and  the  question  is  whether  X 2  should  be  included  in  the  model 
or  not. 

Organization  of  this  section 

The  section  is  organized  as  follows.  Section  3.2.1  considers  the  effects  of 
including  or  deleting  variables  on  the  regression  coefficients,  and  Section 
3.2.2  provides  an  interpretation  of  this  result  in  terms  of  ceteris  paribus 
conditions.  In  Sections  3.2.3  and  3.2.4  we  analyse  the  statistical  conse¬ 
quences  of  omitting  or  including  variables.  Section  3.2.5  shows  that,  in  a 
multiple  regression  model,  each  individual  coefficient  measures  the  effect 
of  an  explanatory  variable  on  the  dependent  variable  after  neutralizing 
for  the  effects  that  are  due  to  the  other  explanatory  variables  included  in 
the  model. 


Two  subsets  of  explanatory  variables  (Xj  and  X2)  influence  the  variable  to  be  explained  (y), 
and  one  subset  of  explanatory  variables  (X\ )  influences  the  other  one  (X2).  The  (total)  effect  of 
X\  on  y  (a)  is  denoted  by  b  in  Chapter  2  (and  by  b r  in  Section  3.2.1),  the  (partial)  effect  of  X1 
on  y  (b)  for  given  value  of  X2  is  denoted  by  b\  and  the  (partial)  effect  of  X2  on  y  for  given  value 
of  X\  is  denoted  by  fc2.  The  effect  of  changes  in  Xj  on  X2  is  denoted  by  P. 


3.2  Adding  or  deleting  variables 


3.2.1  Restricted  and  unrestricted  models 

Two  models:  Notation 

As  before,  we  consider  the  regression  model  y  =  Xfi  +  e  where  X  is  the 
n  x  k  matrix  of  explanatory  variables  with  rank(X)  =  k.  We  partition 
the  explanatory  variables  in  two  groups,  one  with  k  —  g  variables  that 
are  certainly  included  in  the  model  and  another  with  the  remaining  g  vari¬ 
ables  that  may  be  included  or  deleted.  The  matrix  of  explanatory  variables 
is  partitioned  as  X  =  (X\  X2),  where  X\  is  the  n  x  (k  —  g)  matrix  of  obser¬ 
vations  of  the  included  regressors  and  X2  is  the  n  x  g  matrix  with 
observations  on  the  variables  that  may  be  included  or  deleted.  The  k  x  1 
vector  ft  of  unknown  parameters  is  decomposed  in  a  similar  way  in  the 
(k  —  g)  x  1  vector  and  the  g  x  1  vector  f}2.  Then  the  regression  model 
can  be  written  as 


y  =  X1p1+X2p2  +  e.  (3.26) 

All  the  assumptions  on  the  linear  model  introduced  in  Section  3.1.4  are 
assumed  to  hold  true.  In  this  section  we  compare  two  versions  of  the 
model  —  namely,  the  unrestricted  version  in  (3.26)  and  a  restricted  version 
where  X2  is  deleted  from  the  model.  In  particular,  we  investigate  the  conse¬ 
quences  of  deleting  X2  for  the  estimate  of  and  for  the  residuals  of  the 
estimated  model. 

Least  squares  in  the  restricted  model 

In  the  restricted  model  we  estimate  Pi  by  regressing  y  on  X\ ,  so  that 

bR  =  (X,1X1)_1X/1y.  (3.27) 

We  use  the  notation 

eR  =  y-XibR  (3.28) 

for  the  corresponding  restricted  residuals. 

Least  squares  in  the  unrestricted  model 

We  may  write  the  unrestricted  model  as 


+  e. 
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y  =  XP  +  S=(X1  X2)(fo- 


(3.29) 
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The  unrestricted  least  squares  estimator  is  given  by  b  =  {X'X)~1X'y.  Decom¬ 
posing  the  k  x  1  vector  b  into  a  (k  —  g)  x  1  vector  b\  (the  unrestricted 
estimator  of  )  and  a  g  x  1  vector  b2  (the  unrestricted  estimator  of  /12),  we 
can  write  the  unrestricted  regression  as 

y  =  {X\  X2)(^j +e  =  X1b1  +  X2b2  +  e.  (3.30) 

So  we  continue  to  write  e  for  the  residuals  of  the  unrestricted  model.  We  have 
learned  in  the  previous  section  that  the  least  squares  residuals  are  orthogonal 
to  all  regressors.  So  we  now  have  X'teR  =  0  for  the  restricted  model,  and 
X[e  =  0  and  X'2e  =  0  for  the  unrestricted  model.  Note,  however,  that  in 
general  X'^eR  ^  0. 

Comparison  of  bR  and  b. 

To  study  the  difference  between  the  two  estimators  bR  and  b\  of  /?l5  we 
premultiply  (3.30)  by  the  matrix  (X'1Xi)^1X,1  and  make  use  of  X\e  =  0  to 
obtain  bR  =  (X^X^X^y  =  bv  +  (X'xXx)~XX\X2b2,  that  is, 

bR  =  b1+Pb2  (3.31) 

where  the  (k  —  g)  x  g  matrix  P  is  defined  by 

P=  (X'1X1)-1X'1X2.  (3.32) 

So  we  see  that  the  difference  b r  —  b\  depends  on  both  P  and  b2.  If  either  of 
these  terms  vanishes,  then  bR  =  b\.  This  is  the  case,  for  instance,  if  X2  has  no 
effect  at  all  (b2  =  0)  or  if  X\  and  X2  are  orthogonal  (X'^2  =  0).  In  these 
cases  it  does  not  matter  for  the  estimate  of  whether  we  include  X2  in  the 
model  or  not.  However,  in  general  the  restricted  estimate  bR  will  be  different 
from  the  unrestricted  estimate  b\ . 


Comparison  of  e'ReR  and  e'e 

Next  we  compare  the  residuals  of  both  equations  —  that  is,  the  residuals  eR 
in  the  restricted  regression  (3.28)  and  the  residuals  e  in  the  unrestricted 
regression  (3.30).  As  the  unrestricted  model  contains  more  variables  to 
explain  the  dependent  variable,  it  can  be  expected  that  it  provides  a  better 
(or  at  least  not  a  worse)  fit  than  the  restricted  model  so  that  e'e  <  e'ReR.  This 
is  indeed  the  case,  as  we  will  now  show. 
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Derivation  of  sums  of  squares 

To  prove  that  e' e  <  e'ReR  we  start  with  the  restricted  residuals  and  then  substitute 
the  unrestricted  model  (3.30)  for  y.  We  obtain 


eR  =  M-iy  =  +  X2b2  +  e)  =  M\X2b2  +  e.  (3.33) 

Here  M\  =  I  —  Xi(X'1Xi)_1X'1  is  the  projection  orthogonal  to  the  space  spanned 
by  the  columns  of  Xi,  and  we  used  that  M\X\  =  0  and  M\e  =  e  (as  X\e  =  0).  So 
the  difference  between  eR  and  e  depends  on  M\X2  and  b2.  We  see  that  eR  =  e  if, 
for  instance,  X2  has  no  effect  at  all  (b2  =  0).  For  the  sums  of  squared  residuals, 
(3.33)  implies  that 


e'ReR  =  b'2X'2M,X2b2  +  e'e  (3.34) 

where  the  product  term  vanishes  as  X^Mie  =  X^e  —  X2Xi(X'1Xi)_1X'1e  =  0 
because  X\e  =  0  and  X2e  =  0. 

Interpretation  of  result 

As  Mi  is  a  positive  semidefinite  matrix,  it  follows  that  b'2X'2MiX2b2  = 
(X2b2)' Mi(X2b2)  >  0  in  (3.34),  so  that 

e’ReR  >  e!e 

and  the  inequality  is  strict  unless  M\X2b2  —  0.  This  shows  that  adding 
variables  to  a  regression  model  in  general  leads  to  a  reduction  of  the  sum 
of  squared  residuals.  If  this  reduction  is  substantial,  then  this  motivates  to 
include  the  variables  X2  in  the  model,  as  they  provide  a  significant  additional 
explanation  of  the  dependent  variable.  A  test  for  the  significance  of  the 
increased  model  fit  is  derived  in  Section  3.4.1. 


Example  3.1:  Bank  Wages  (continued) 

To  illustrate  the  results  in  this  section  we  return  to  the  illustration  in  Section 
3.1.7.  The  dependent  variable  y  is  the  yearly  wage  (in  logarithms).  In  the 
restricted  model  we  take  as  explanatory  variables  ‘education’  and  a  constant 
term,  and  in  the  notation  of  Section  3.2.1  these  two  variables  are  collected  in 
the  matrix  X\  with  n  rows  and  k  —  g  =  2  columns.  In  the  unrestricted  model 
we  take  as  explanatory  variables  ‘education’,  a  constant  term,  and  the 
additional  variable  ‘begin  salary’  (in  logarithms).  This  additional  variable  is 
denoted  by  the  matrix  X2  with  n  rows  and  g  =  1  column  in  this  case. 

The  results  of  the  restricted  and  unrestricted  regressions  are  given  in 
Panels  1  and  2  of  Exhibit  3.7.  The  unrestricted  model  (in  Panel  2)  has  a 
larger  R 1  than  the  restricted  model  (in  Panel  1).  As  R2  =  1  —  (e'e/SST) 
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Panel  1:  Dependent  Variable:  LOGSAL 
Method:  Least  Squares 

Variable 

Coefficient 

Std.  Error 

C 

9.062102 

0.062738 

EDUC 

0.095963 

0.004548 

R-squared 

0.485447 

Panel  2:  Dependent  Variable:  LOGSAL 
Method:  Least  Squares 

Variable 

Coefficient 

Std.  Error 

C 

1.646916 

0.274598 

EDUC 

0.023122 

0.003894 

LOGSALBEGIN 

0.868505 

0.031835 

R-squared 

0.800579 

Panel  3:  Dependent  Variable:  RESIDUNREST 
Method:  Least  Squares 

Variable 

Coefficient 

C 

3.10E-11 

EDUC 

2.47E-13 

LOGSALBEGIN 

— 3.55E-12 

R-squared 

0.000000 

Panel  4:  Dependent  Variable:  RESIDREST 

Method:  Least  Squares 

Variable 

Coefficient 

C 

3.78E-13 

EDUC 

— 2.76E-14 

R-squared 

0.000000 

Panel  5:  Dependent  Variable:  RESIDREST 

Method:  Least  Squares 

Variable 

Coefficient 

C 

-4.449130 

LOGSALBEGIN 

0.460124 

R-squared 

0.324464 

Panel  6:  Dependent  Variable:  LOGSALBEGIN 

Method:  Least  Squares 

Variable 

Coefficient 

C 

8.537878 

EDUC 

0.083869 

R-squared 

0.470211 

Exhibit  3.7  Bank  Wages  (Example  3.1) 

Regression  in  the  restricted  model  (Panel  1)  and  in  the  unrestricted  model  (Panel  2).  The 
residuals  of  the  unrestricted  regression  (denoted  by  RESIDUNREST)  are  uncorrelated  with 
both  explanatory  variables  (Panel  3),  but  the  residuals  of  the  restricted  regression  (denoted  by 
RESIDREST)  are  uncorrelated  only  with  education  (Panel  4)  and  not  with  the  logarithm  of 
begin  salary  (Panel  5).  The  regression  in  Panel  6  shows  that  the  logarithm  of  begin  salary  is 
related  to  education. 
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=  0.801  >  0.485  =  1  —  (e^ej>)/SST,  it  follows  that  e'e  <  e'ReR.  Panels  3-5  in 
Exhibit  3.7  show  that  X\ e  =  0  (Panel  3),  X\eR  =  0  (Panel  4),  but  X'2eR  ^  0 
(Panel  5).  Panel  6  shows  the  regression  of  X2  on  X\  corresponding  to  (3.32). 
It  follows  from  the  outcomes  in  Panel  1  (for  £>r),  Panel  2  (for  b\  and  b2),  and 
Panel  6  (for  P)  that  (apart  from  rounding  errors) 


/  9.062  \ 
V  0.096  ) 


(  1.647  \  /  8.538  \ 

V  0.023  )  +  V  0.084  J 


0.869, 


which  verifies  the  relation  6r  =  b\  +  Pb2  in  (3.31)  between  restricted  and 
unrestricted  least  squares  estimates. 


Summary  of  computations 

In  the  restricted  model  (where  y  is  regressed  on  k  —  g  regressors)  the 
(k  —  g)  x  1  vector  of  least  squares  estimates  is  given  by  6r  =  (X^Xi)-1  X\y. 
In  the  unrestricted  model  (where  y  is  regressed  on  the  same  k  —  g  regressors 
and  g  additional  regressors)  the  k  x  1  vector  of  least  squares  estimates  is 
given  by  b  =  (X'Xf^X'y. 

Let  b  be  decomposed  in  two  parts  as  b  =  (b\ ,  b'2)' ,  where  the  (k  —  g)  x  1 
vector  b\  corresponds  to  the  regressors  of  the  restricted  model  and  62  to  the  g 
added  regressors.  Then  the  relation  between  £>r  and  b  1  is  given  by 
^R  =  b\  +  Pb2. 


3.2.2  Interpretation  of  regression  coefficients 

Relations  between  regressors:  The  effect  of  on  X2 

The  result  in  (3.31)  shows  that  the  estimated  effect  of  X\  on  y  changes  from 
b\  to  bR  =  b\  +  Pb2  if  we  delete  the  regressors  X2  from  the  model.  The 
question  arises  which  of  these  two  estimates  should  be  preferred.  To  investi¬ 
gate  this  question,  we  first  give  an  interpretation  of  the  matrix  P  in  (3.32). 
This  matrix  may  be  interpreted  in  terms  of  regressions,  where  each  column  of 
X2  is  regressed  on  Xi.  For  the  ;th  column  of  X2  —  say,  z  —  this  gives  esti¬ 
mated  coefficients  p ;  =  (X'1Xi)_1X'12:  with  explained  part  z  =  X2pj  and  re¬ 
sidual  vector  z  —  z  =  M\z  where  M\  =  I  — Xi(X'1Xi)_1X'1.  Collecting  the  g 
regressions  z  =  z  +  Miz  in  g  columns,  we  get 


X2  =  Xi?  +  MiX2 

=  ‘explained  part’  +  ‘residuals’ 


(3.35) 


with  P  =  (XjXi)  1X'lX2  as  defined  in  (3.32). 
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Non-experimental  data  and  the  ceteris  paribus  idea 

The  auxiliary  regressions  (3.35)  have  an  interesting  interpretation.  In  experi¬ 
mental  situations  where  we  are  free  to  choose  the  matrices  X\  and  X2,  we  can 
choose  orthogonal  columns  so  that  P  =  0.  The  result  in  (3.31)  shows  that 
neglecting  the  variables  Xi  then  has  no  effect  on  the  estimate  of  On  the 
other  hand,  if  X\  and  X2  are  uncontrolled ,  then  there  are  several  possible 
reasons  why  P  could  be  different  from  0.  For  instance,  X\  may  ‘cause’  X2  or 
Xi  may  ‘cause’  Xi,  or  there  could  exist  a  third  ‘cause’  in  the  background  that 
influences  both  X\  and  Xi.  It  may  be  useful  to  keep  this  in  mind  when 
interpreting  the  restricted  estimate  bn  and  the  unrestricted  estimate  b.  Con¬ 
sider  the  second  element  of  b&  (the  first  element  is  the  intercept) .  Traditionally, 
in  a  linear  relationship  this  measures  the  partial  derivative  dy/dz  (where  z  now 
denotes  the  second  explanatory  variable  —  that  is,  the  second  column  of  X\ ). 
It  answers  the  question  how  y  will  react  on  a  change  in  z  ceteris  paribus  —  that 
is,  if  all  other  things  remain  equal.  Now  the  question  is:  which  ‘other  things’? 
In  the  restricted  model,  the  ‘other  things’  clearly  are  the  remaining  columns  of 
the  matrix  X\  and  the  residual  e«,  and  in  the  unrestricted  model  the  ‘other 
things’  are  the  same  columns  of  X\  and  in  addition  the  columns  of  X2  and  the 
residual  e. 

Direct,  indirect,  and  total  effects 

So  the  restricted  and  the  unrestricted  model  raise  different  questions  and  one 
should  not  be  surprised  if  different  questions  lead  to  different  answers.  Take 
the  particular  case  that  X\  ‘causes’  X2.  Then  a  change  of  X\  may  have  two 
effects  on  y,  a  direct  effect  measured  by  b\  and  an  indirect  effect  measured  by 
Pbi.  It  is  seen  from  (3.31)  that  these  are  precisely  the  two  components  of  b r. 
Under  these  circumstances  it  may  be  hard  to  keep  X2  constant  if  Xi  changes. 
So  in  this  case  it  may  be  more  natural  to  look  at  the  restricted  model.  That  is, 
bR  gives  a  better  idea  of  the  total  effect  on  y  of  changes  in  X\  than  b\,  as  it  is 
unnatural  to  assume  that  X2  remains  fixed. 

If  the  variables  would  satisfy  exact  functional  relationships,  say 
y  =  f(x  1,  X2)  and  xi  =  h(x  1)  (with  k  =  2  and  g  =1),  then  the  marginal  effect 
of  X\  on  y  is  given  by 


dy  df  dh  df 
dx  1  dx\  dx  1  0x2 

Here  the  total  effect  of  x\  on  y  (on  the  left-hand  side)  is  decomposed  as  the 
sum  of  two  terms  (on  the  right-hand  side),  the  direct  effect  of  X\  on  y  (the  first 
term)  and  the  indirect  effect  that  runs  via  X2  (the  second  term).  The  result  in 
(3.31)  shows  that  the  same  relation  holds  true  when  linear  relationships  are 
estimated  by  least  squares. 


3.2  Adding  or  deleting  variables  141 


Interpretation  of  regression  coefficients  in  restricted 
and  unrestricted  model 

If  one  wants  to  estimate  only  the  direct  effect  of  an  explanatory  variable  — 
that  is,  under  the  assumption  that  all  other  explanatory  variables  remain 
fixed  —  then  one  should  estimate  the  unrestricted  model  that  includes  all 
explanatory  variables.  On  the  other  hand,  if  one  wants  to  estimate  the  total 
effect  of  an  explanatory  variable  —  that  is,  the  direct  effect  and  all  the 
indirect  effects  that  run  via  the  other  explanatory  variables  —  then  one 
should  estimate  the  restricted  model  where  all  the  other  explanatory  vari¬ 
ables  are  deleted. 


Example  3.2:  Bank  Wages  (continued) 

To  illustrate  the  relation  between  direct,  indirect,  and  total  effects,  we  return 
to  Example  3.1  on  bank  wages.  The  current  salary  of  an  employee  is  influ¬ 
enced  by  the  education  and  the  begin  salary  of  that  employee.  Clearly,  the 
begin  salary  may  for  a  large  part  be  determined  by  education.  The  results 
discussed  in  Example  3.1  are  summarized  in  Exhibit  3.8  and  have  the 
following  interpretation.  In  the  restricted  model  (without  begin  salary) 
the  coefficient  bR  =  0.0960  measures  the  total  effect  of  education  on  salary. 
This  effect  is  split  up  in  two  parts  in  the  unrestricted  model  as  bR  =  b  +  pc. 


(a) 


0.0960 

x2  =  education  - 5 - 


y  =  log  current  salary 


(b) 


x2  =  education 


0.0839  '' 


x3  =  log  begin  salary 

Exhibit  3.8  Bank  Wages  (Example  3.2) 


0.0231 


y  =  log  current  salary 


0.8685 


Two  variables  (education  and  begin  salary)  influence  the  current  salary,  and  education  also 
influences  the  begin  salary.  The  total  effect  of  education  on  salary  consists  of  two  parts,  a 
direct  effect  and  an  indirect  effect  that  runs  via  the  begin  salary.  If  salary  is  regressed  on 
education  alone,  the  estimated  effect  is  0.0960,  and  if  salary  is  regressed  on  education 
and  begin  salary  together,  then  the  estimated  effects  are  respectively  0.0231  and  0.8685. 
If  begin  salary  is  regressed  on  education,  the  estimated  effect  is  0.0839.  In  this  case  the 
direct  effect  is  0.0231,  the  indirect  effect  is  0.0839  •  0.8685  =  0.0729,  and  the  total  effect 
is  0.0231  +  0.0729  =  0.0960. 


E 


XM301 BWA 
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Here  b  =  0.0231  measures  the  direct  effect  of  education  on  salary  under  the 
assumption  that  begin  salary  remains  constant  (the  ceteris  paribus  condi¬ 
tion),  and  pc  =  0.0839  ■  0.8685  =  0.0729  measures  the  indirect  effect  of 
education  on  salary  that  is  due  to  a  higher  begin  salary. 

Clearly,  the  estimates  bn  and  b  have  very  different  interpretations. 

Exercises:  E:  3.14a-c. 


3.2.3  Omitting  variables 

Choice  of  explanatory  variables 

For  most  economic  variables  to  he  explained,  one  can  find  a  long  list  of 
possible  explanatory  variables.  The  question  is  which  of  these  variables 
should  be  included  in  the  model.  It  seems  intuitively  reasonable  to  include 
variables  only  if  they  have  a  clear  effect  on  the  dependent  variable  and  to 
omit  variables  that  are  less  important.  In  this  section  we  analyse  the  effect  of 
omitting  variables  from  the  model,  and  in  the  next  section  of  including 
irrelevant  variables.  We  focus  on  the  statistical  properties  of  the  least  squares 
estimator.  When  comparing  the  restricted  and  the  unrestricted  model,  a 
remark  about  the  term  true  model  is  in  order;  see  also  our  earlier  remarks 
in  Section  2.2.1  (p.  87).  Strictly  speaking,  the  term  ‘true’  model  has  a  clear 
interpretation  only  in  the  case  of  simulated  data.  When  the  data  are  from  the 
real  world,  then  the  DGP  is  unknown  and  can  at  best  be  approximated. 
Nevertheless,  it  helps  our  insight  to  study  some  of  the  consequences  of 
estimating  a  different  model  from  the  true  model. 

Omitted  variables  bias 

In  this  section  we  consider  the  consequences  of  omitting  variables  from  the 
‘true  model’.  Suppose  the  ‘true  model’  is 

y  =  X1p1+X2p2  +  e, 

but  we  use  the  model  with  only  X\  as  explanatory  variables  and  with  bR  as 
our  estimator  of  .  Then  we  have 

bR  =  (X'1X1)_1X'1y  =  /i,  +  (X\X^X\X2p2  +  (X'jXr)-1^. 

This  shows  that 


E[bR]  =  /A  +  (x'1x1r1x,1x2p2  =  /?,  +  pp2. 
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The  last  term  is  sometimes  called  the  omitted  variables  bias.  The  estimator  bn 
is  in  general  a  biased  estimator  of  [S1.  In  the  light  of  our  discussion  in  Section 
3.2.2  we  should  not  be  surprised  by  this  ‘bias’,  since  bR  and  b\  have  different 
interpretations. 

Variance  reduction 

To  compute  the  variance  of  bR,  the  above  two  expressions  show  that 
bR  -  E[bR]  =  (X'1Xi)“1X'1£  so  that 

var (bR)  =  E[(bR  -  E[bR]){bR  -  E[bR])']  =  a2{X\Xx)~x. 

It  is  left  as  an  exercise  (see  Exercise  3.7)  to  prove  that  this  is  smaller  than  the 
variance  of  the  unrestricted  least  squares  estimator  b\,  that  is,  var(bi)  — 
var (bR)  is  positive  semidefinite. 

Summary 

Summarizing,  the  omission  of  relevant  variables  leads  to  biased  estimates  but 
to  a  reduction  in  variance.  If  one  is  interested  in  estimating  the  ‘direct’  effect 
Pi,  then  omission  of  X2  is  undesirable  unless  the  resulting  bias  is  small 
compared  to  the  gain  in  efficiency,  for  instance,  when  ft2  is  small  enough. 
This  means  that  variables  can  be  omitted  if  their  effect  is  small,  as  this  leads 
to  an  improved  efficiency  of  the  least  squares  estimator. 


3.2.4  Consequences  of  redundant  variables 

Redundant  variables  lead  to  inefficiency 

A  variable  is  called  redundant  if  it  plays  no  role  in  the  ‘true’  model.  Suppose 
that 


y  =  X1P1  +  £, 


that  is,  the  DGP  satisfies  Assumptions  1-6  with  fi2  =  0-  In  practice  we  do  not 
know  that  fi2  is  zero.  Suppose  that  the  variables  X2  are  included  as  additional 
regressors,  so  that  the  estimation  results  are  given  by 

y  =  X\b\  +  X2b2  +  e. 

Although  the  estimated  model  y  =  X\fi1  +  X2fi2  +  s  neglects  the  fact  that 
P2  —  0,  it  is  not  wrongly  specified  as  it  satisfies  Assumptions  1-6.  The  result 
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(3.18)  shows  that  E[b i]  =  ft1  and  E[b2\  =  ft2  =  0  in  this  case.  Therefore  b\  is 
an  unbiased  estimator.  However,  this  estimator  is  inefficient  in  the  sense  that 
var(^i)  —  var(&s)  is  positive  semidefmite.  That  is,  if  the  model  contains 
redundant  variables,  then  the  parameters  are  estimated  with  less  precision 
(larger  standard  errors)  as  compared  with  the  model  that  excludes  the 
redundant  variables.  To  prove  this,  we  write  (3.31)  as  bi  =  bR  —  Pb2.  Then 
the  result 


var(fei)  =  var  (bR)  +  Pvar(b2)P' 

follows  from  the  fact  that  co v(b2,  bit)  =  0,  as  we  will  prove  below.  Because 
var  (b2)  is  positive  definite,  it  follows  that  Pvar(b2)P'  is  positive  semidefinite. 
So  the  variances  of  the  elements  of  b\  are  larger  than  those  of  the  corres¬ 
ponding  elements  of  bR,  unless  the  corresponding  rows  of  P  are  zero.  That  is, 
if  ft 2  =  0,  then  in  general  we  gain  efficiency  by  deleting  the  irrelevant  vari¬ 
ables  Xi  from  the  model.  This  shows  the  importance  of  imposing  restrictions 
on  the  model. 


□  Proof  of  auxiliary  result  cov(fa2,  bR)  =  0 

It  remains  to  prove  that  cov(b2,  bR)  =  0.  The  basic  step  is  to  express  bR  and  b2  in 
terms  of  e.  As  ft2  =  0  it  follows  that 


bR  =  (X'1X1)-1X'1y  =  (X'Xir^Xift  +  s)  =  ftx  +  (X'1X1)-1X'1s.  (3.36) 

To  express  b2  in  terms  of  e  we  first  prove  as  an  auxiliary  result  that  the  g  x  g 
matrix  X2M1X2  is  non-singular.  As  X2M1X2  =  (MiX2),MiX2,  it  suffices  to  prove 
that  the  n  x  g  matrix  M1X2  has  rank  g.  We  use  (3.35)  and  (3.32)  to  write 


(Xj  MrX 2)  =  (Xj  X2-XiP) 


(X,  X2)('  -/). 


The  last  matrix  is  non-singular  and  Assumption  1  states  that  the  n  x  k  matrix 
(Xi  X2)  has  rank  k.  So  (Xi  M\Xt)  also  has  rank  k  —  that  is,  all  its  columns  are 
linearly  independent.  This  means  in  particular  that  all  columns  of  the  n  x  g  matrix 
M1X2  are  linearly  independent,  so  that  this  matrix  has  rank  g.  This  proves  that 
X2-M1X2  is  non-singular. 

Now  the  result  in  (3.33)  states  that  Miy  =  M\X2b2  +  e.  If  we  premultiply  this 
by  X'2,  then  we  obtain,  as  X'2e  =  0,  that  X'2M\y  =  X2MiX2f>2-  As  X2M1X2  is  non¬ 
singular,  this  means  that 


b2  =  {X'2MiX2)~1X'2Miy.  (3.37) 

We  now  substitute  the  ‘true’  model  y  =  X\fti  +  e  into  (3.37).  Because  MjXj  =  0, 
this  gives 
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b2  =  (X'2M1X1)-1X'2M1s  (as  /?2  =  0).  (3.38) 

Using  the  expressions  (3.36)  and  (3.38)  that  express  bR  and  b2  in  terms  of  s,  we 
obtain  (as  MjX i  =  0) 

CO v(b2,  bR)  =  E[(b2  -  E[b2])(bR  -  £[hR])'] 

=  £[(X;M1X2)-1X^M1ee'X1(X,1X1)-1] 

=  <t2(X^MiX2)“1X^MiXi(X'1Xi)-1  =  0. 


Summary  of  results 

We  summarize  the  results  of  this  and  the  foregoing  section  in  Exhibit  3.9.  If 
we  include  redundant  variables  (/?2  =  0)  in  our  model,  then  this  causes  a  loss 
of  efficiency  of  the  estimators  of  the  parameters  (/?2 )  of  the  relevant  variables. 
That  is,  by  excluding  irrelevant  variables  we  gain  efficiency.  However,  if  we 
exclude  relevant  variables  (fi2  ^  0),  this  causes  a  bias  in  the  estimators.  So  the 
choice  between  a  restricted  and  an  unrestricted  model  involves  a  trade-off 
between  the  bias  and  efficiency  of  estimators.  In  practice  we  do  not  know  the 
true  parameters  /f2  but  we  can  test  whether  fi2  —  0.  This  is  discussed  in 
Sections  3.3  and  3.4. 


Data  Generating  Process 

Estimated  Model 

y  —  +  X2P2  + £ 

(p2  non-zero) 

y  —  ^1^1  +  £ 

(Pi  =  0) 

y  =  X1bR  +  eR 

bR  biased,  but  smaller  variance 
than  b\ 

bR  best  linear  unbiased 

y  =  X\b\  +  X2b2  +  e 

b\  unbiased,  but  larger  variance 
than  bR 

b\  unbiased,  but  not  efficient 

Exhibit  3.9  Bias  and  efficiency 

Consequences  of  regression  in  models  that  contain  redundant  variables  (bottom  right  cell)  and 
in  models  with  omitted  variables  (top  left  cell).  Comparisons  should  be  made  in  columns  —  that 
is,  for  a  fixed  data  generating  process.  The  cells  show  the  statistical  properties  of  the  estimators 
bR  (of  the  restricted  model  where  X2  is  deleted,  first  row)  and  b\  (of  the  unrestricted  model  that 
contains  both  Xi  and  X?,  second  row)  for  the  model  parameters  under  Assumptions  1-6. 


Exercises:  T:  3.7d. 


3.2.5  Partial  regression 

Multiple  regression  and  partial  regression 

In  this  section  we  give  a  further  interpretation  of  the  least  squares  estimates 
in  a  multiple  regression  model.  In  Section  3.2.2  we  mentioned  that  these 
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estimates  measure  direct  effects  under  the  ceteris  paribus  condition  that  the 
other  variables  are  kept  fixed.  As  such  ‘controlled  experiments’  are  almost 
never  possible  in  economics,  the  question  arises  what  is  the  precise  interpret¬ 
ation  of  this  condition.  As  will  be  shown  below,  it  means  that  the  indirect 
effects  that  are  caused  by  variations  in  the  other  variables  are  automatically 
removed  in  a  multiple  regression.  We  consider  again  the  model  where  the 
n  x  k  matrix  of  explanatory  variables  X  is  split  in  two  parts  as  X  =  (Xi  X2), 
where  X\  is  an  n  x  (k  —  g)  matrix  and  X2  an  n  x  g  matrix.  The  regression  of 
y  on  X|  and  X2  gives  the  result 

y  =  X\b\  +  X2&2  +  e- 

Another  approach  to  estimate  the  effects  of  Xi  on  y  is  the  following  two-step 
method,  called  partial  regression. 


Partial  regression 

•  Step  1:  Remove  the  effects  of  X2 .  Here  we  remove  the  side  effects  that  are 
caused  by  X2.  That  is,  regress  y  on  X2  with  residuals  Mxy,  where 
AT  =  I  —  Xi[X'iXi)  ^ X'i.  Also  regress  each  column  of  X\  on  Xi  with  re¬ 
siduals  M2X1.  Here  ATy  and  M2X1  can  be  interpreted  as  the  ‘cleaned’ 
variables  obtained  after  removing  the  effects  of  X2.  Note  that,  as  a  con¬ 
sequence  of  the  fact  that  residuals  are  orthogonal  to  explanatory  variables, 
the  ‘cleaned’  variables  M2y  and  M2X1  are  uncorrelated  with  X2. 

•  Step  2:  Estimate  the  ‘cleaned’  effect  ofXi  on  y.  Now  estimate  the  ‘cleaned’ 
effect  of  Xi  on  y  by  regressing  Miy  on  ATXi.  This  gives 

M2y  =  M2Xih*  +  <?* 

where  K  =  [(M2Xi)'  M2Xi]-1(M2Xi)'  M2y  =  (X,1M2Xi)-1X,1M2y  and 
e*  =  ATy  —  M2X1bt  are  the  corresponding  residuals. 


The  result  of  Frisch-Waugh 

The  result  of  Frisch-Waugh  states  that 

b*  =  b],  e*  =  e.  (3.39) 

That  is,  by  including  X2  in  the  regression  model,  the  estimated  effect 
b\  of  Xi  on  y  is  automatically  ‘cleaned’  from  the  side  effects  caused  by  X2. 


□  Proof  of  the  result  of  Frisch-Waugh 

To  prove  the  result  of  Frisch-Waugh,  we  write  out  the  normal  equations 
X'Xb  =  X'y  in  terms  of  the  partitioned  matrix  X  =  (Xi  X2). 
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X!xXxbx  +  X\X2b2  =  X\y  (3.40) 

X'2Xxbx  +  X'2X2b2  =  X'2y  (3.41) 

From  (3.41)  we  get  b2  =  (X'2X2)^X2 y  —  (X'2X2y1X2Xxbx,  and  by  substituting 
this  in  (3.40)  and  arranging  terms  it  follows  that 

X\M2Xtb,  =  X^Miy, 

where  M2  =  I  —  X^X^X^1  X2.  In  Section  3.2.4  we  proved  that  X'2MxX2  is 
invertible,  and  a  similar  argument  shows  that  also  X\M2Xx  is  invertible.  This 
shows  that  fc*  =  bx.  Further  it  follows  from  (3.30)  and  the  facts  that  M2X2  =  0 
and  M2e  =  e  that 


M2y  =  M2Xxbx  +  e. 


As  bx  =  this  shows  that  e*  =  e. 

Summary:  To  estimate  the  effect  of  Xx  on  y,  should  we 
include  X2  or  not? 

Suppose  that  we  wish  to  estimate  the  effect  of  a  certain  set  of  regressors  (Xi) 
on  the  dependent  variable  (y).  The  question  is  whether  certain  other  variables 
(X2)  should  be  added  to  or  omitted  from  the  regression.  If  the  two  sets  of 
regressors  Xi  and  X2  are  related  (in  the  sense  that  X'xX2  ^  0)?  then  the 
estimated  effects  Xi  — >  y  differ  in  the  two  models.  The  partial  effect 
Xx  — »  y  ( ceteris  paribus ,  as  if  X2  were  fixed)  cannot  be  determined  if  X2  is 
deleted  from  the  model,  because  then  the  indirect  effect  Xx  — »  X2  — >■  y  is  also 
present. 

To  isolate  the  direct  effect  Xi  — >  y  one  can  first  remove  the  effects  of  X2  on 
y  and  of  X2  on  Xx,  after  which  the  cleaned  AT>y  is  regressed  on  the  cleaned 
M2X1.  Instead  of  this  partial  regression,  one  can  also  include  X2  as  add¬ 
itional  regressors  in  the  model  and  regress  y  on  Xi  and  X2. 

On  the  other  hand,  if  one  is  interested  in  the  total  effect  of  Xx  on  y,  then  X2 
should  be  deleted  from  the  model. 

Three  illustrations 

There  are  several  interesting  applications  of  the  result  of  Frisch-Waugh,  and 
we  mention  three  of  them. 
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Case  1:  Deviations  from  sample  mean 

Let  X2  have  only  one  column  consisting  of  ones.  If  we  premultiply  by  M2,  this 
amounts  to  taking  deviations  from  means.  For  instance,  regressing  y  on  X2 


148 


3  Multiple  Regression 


gives  an  estimated  coefficient  (X^X^^X^y  =  ^^y,  =  y  with  residuals 
y,  —  y,  so  that  the  elements  of  M2y  are  (yi  —  y),  ■  ■  ■ ,  (y„  —  y).  The  result  of 
Frisch-Waugh  states  that  inclusion  of  a  constant  term  gives  the  same  results 
as  a  regression  where  all  variables  are  expressed  in  deviation  from  their 
means.  In  fact  we  have  already  met  this  kind  of  formula  in  Chapter  2  —  for 
instance,  in  formula  (2.8)  for  the  least  squares  slope  estimator. 


Case  2:  Detrending 

Let  X2  consist  of  two  columns,  a  constant  and  a  trend,  as  follows. 


X2 


/! 

!\ 

1 

2 

\1 

n  / 

Then  the  first  step  in  partial  regression  amounts  to  removing  the  (linear) 
trends  from  y  and  the  columns  of  X\.  This  case  was  the  subject  of  the  article 
by  R.  Frisch  and  F.  V.  Waugh,  ‘Partial  Time  Regressions  as  Compared  with 
Individual  Trends’,  Econometrica,  1  (1933),  387-401. 


Case  3:  Single  partial  relation 

Let  Xi  consist  of  a  single  variable,  so  that  k  —  g—  1  and  X2  contains  the 
remaining  k  —  1  variables.  Then  both  M2X \  and  M2y  have  one  column  and 
one  can  visualize  the  relation  between  these  columns  by  drawing  a  scatter 
plot.  This  is  called  a  partial  regression  scatter  plot.  The  slope  of  the  regres¬ 
sion  line  in  this  plot  is  b\.  This  equals  the  slope  parameter  of  Xi  in  the 
multiple  regression  equation  (3.30). 


Example  3.3:  Bank  Wages  (continued) 

Using  the  data  on  bank  wages  of  the  illustration  in  Section  3.1.7,  we  illustrate 
some  of  the  foregoing  results  for  the  model 


yi  —  Pi  +  Plx2  i  +  l^3x3i  +  eii 


where  y,  denotes  the  logarithm  of  yearly  salary,  Xu  the  education,  and  x^,  the 
logarithm  of  the  begin  salary  of  the  z'th  employee.  This  is  the  regression  of 
the  illustration  in  Section  3.1.7,  with  results  in  Exhibit  3.5,  Panel  3.  We  will 
now  consider  (i)  the  above-mentioned  Case  1,  and  (ii)  the  above-mentioned 
Case  3. 
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Regression  1:  Dependent  Variable:  LOGSAL 

Variable  Coefficient  Std.  Error 

_ C  10.35679  0.018250 


Regression  2:  Dependent  Variable:  EDUC 

Variable 

Coefficient 

Std.  Error 

C 

13.49156 

0.132505 

Regression  3:  Dependent  Variable:  LOGSALBEGIN 

Variable  Coefficient  Std.  Error 

C  9.669405  0.016207 


Regression  4:  Dependent  Variable:  DMLOGSAL 

Variable 

Coefficient 

Std.  Error 

DMEDUC 

0.023122 

0.003890 

DMLOGSALBEGIN 

0.868505 

0.031801 

R-squared 

0.800579 

Adjusted  R-squared 

0.800157 

S.E.  of  regression 

0.177624 

Sum  squared  resid 

14.89166 

Regression  5:  Dependent  Variable:  LOGSAL 

Variable  Coefficient  Std.  Error 

— C  0.705383  0.232198 

LOGSALBEGIN  0.998139  0.023998 


Regression  6:  Dependent  Variable:  EDUC 

Variable  Coefficient 

Std.  Error 

C  -40.71973 

2.650406 

LOGSALBEGIN  5.606476 

0.273920 

Regression  7:  Dependent  Variable:  RESLOGSAL 

Variable 

Coefficient 

Std.  Error 

RESEDUC 

0.023122 

0.003885 

R-squared 

0.069658 

Adjusted  R-squared 

0.069658 

S.E.  of  regression 

0.177436 

Sum  squared  resid 

14.89166 

Exhibit  3.10  Bank  Wages  (Example  3.3) 

Two  illustrations  of  partial  regressions.  Regressions  1-3  determine  the  effect  of  the  constant 
term  on  the  variables  LOGSAL,  EDUC,  and  LOGSALBEGIN.  The  residuals  of  these  regres¬ 
sions  (which  correspond  to  taking  the  original  observations  in  deviation  from  their  sample 
mean  and  which  are  denoted  by  DM)  are  related  in  Regression  4.  Regressions  5  and  6 
determine  the  effect  of  LOGSALBEGIN  on  EDUC  and  LOGSAL.  The  residuals  of  these  two 
regressions  (which  correspond  to  the  variables  LOGSAL  and  EDUC  where  the  effect  of 
LOGSALBEGIN  has  been  eliminated  and  which  are  denoted  by  RESLOGSAL  and  RESEDUC) 
are  related  in  Regression  7. 
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(i)  Deviation  from  mean 

We  first  consider  Case  1  above,  where  the  variables  are  expressed  in  devi¬ 
ations  from  their  sample  mean.  In  the  first  step  all  variables  are  regressed  on  a 
constant,  and  then  the  demeaned  y  is  regressed  on  the  two  demeaned 
variables  xi  and  X3.  This  is  shown  in  Regressions  1-4  in  Exhibit  3.10.  If 
we  compare  the  results  of  Regression  4  in  Exhibit  3.10  with  those  of  the 
unrestricted  regression  in  Exhibit  3.5,  Panel  3,  we  see  that  the  regression 
coefficients  are  equal.  However,  there  is  a  small  difference  in  the  calculated 
standard  errors  (see  Exercise  3.9). 

(ii)  Direct  effect  of  education  on  salary 

Next  we  consider  Case  3  above  and  give  a  partial  regression  interpretation  of 
the  coefficient  bi  =  0.023  in  Exhibit  3.5,  Panel  3,  for  the  estimated  ‘direct 
effect’  of  education  on  salary  for  ‘fixed’  begin  salary.  This  is  shown  in 
Regressions  5-7  in  Exhibit  3.10.  In  terms  of  the  model  y  =  XijS1+ 
X2P2  +  £  in  (3.26),  let  X2  be  the  474  x  2  matrix  with  a  column  of  ones 
and  with  the  values  of  X3  (begin  salary)  in  column  2,  and  let  X\  be  the 
474  x  1  vector  containing  the  values  of  X2  (education).  To  remove  the  effects 
of  the  other  variables,  y  and  Xi  are  first  regressed  on  X2  with  residuals  ATy 
and  M2X1,  and  in  the  second  step  ATy  is  regressed  on  ATXi.  The  last 
regression  corresponds  to  the  model  Al2y  =  (M2X\)b*  +  e*  in  the  result 
of  Frisch-Waugh.  This  result  states  that  the  estimated  coefficient  in  this 


RESLOGSAL  vs.  RESEDUC 


RESEDUC 

Exhibit  3.1 1  Bank  Wages  (Example  3.3) 

Partial  regression  scatter  plot  of  (logarithmic)  salary  against  education,  with  regression 
line.  On  the  vertical  axis  are  the  residuals  of  the  regression  of  log  salary  on  a  constant  and 
log  begin  salary  and  on  the  horizontal  axis  are  the  residuals  of  the  regression  of  education 
on  a  constant  and  log  begin  salary.  The  slope  of  the  regression  line  in  the  figure  indicates 
the  direct  effect  of  education  on  log  salary  after  neutralizing  for  the  indirect  effect  via  log 
begin  salary. 
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regression  is  equal  to  the  coefficient  in  the  multiple  regression  model,  which 
is  verified  by  comparing  Regression  7  in  Exhibit  3.10  with  the  result  in 
Exhibit  3.5,  Panel  3.  The  corresponding  partial  regression  scatter  plot  is 
shown  in  Exhibit  3.11,  where  RESLOGS AL  denotes  M2j  and  RESEDUC 
denotes  M2X1. 

Exercises:  T:  3.9;  E:  3.16,  3.18. 
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3.3  The  accuracy  of  estimates 


3.3.1  The  t- test 

Uses  Sections  1.2.3,  1.4.1,  1.4.2. 


Test  of  significance 

To  test  whether  we  should  include  a  variable  in  the  model  or  not,  we  can  test 
its  statistical  significance.  To  check  whether  the  /’th  explanatory  variable  has 
a  significant  effect  on  y,  we  test  the  null  hypothesis  Hq  :  =  0  against  the 

alternative  H\  :  ji;  ^  0. 


Derivation  of  t-test 

For  this  purpose  we  suppose  that  Assumptions  1-7  hold  true.  As  the  least  squares 
estimator  b  is  a  linear  function  of  s,  it  follows  that,  under  these  assumptions,  b 
is  normally  distributed.  Its  mean  and  variance  are  given  by  (3.18)  and  (3.19), 
so  that 


b  ~N(p,  ^(X'X)-1).  (3.42) 

The  variance  of  the  /th  component  bj  of  the  least  squares  estimator  b  is  equal 
to  <J2ajj,  where  u/;  is  the  /th  diagonal  element  of  (X'X)-1.  By  standardization 
we  get 


bi-Pj 


N(0,  1). 


This  expression  cannot  be  used  to  test  whether  /i;  =  0,  as  the  variance  er2  is 
unknown.  Therefore  er  is  replaced  by  s,  where  s2  is  the  unbiased  estimator  of  a2 
defined  in  (3.22). 

To  derive  the  distribution  of  the  resulting  test  statistic  we  use  the  following 
results  of  Section  1.2.3  (p.  32  and  34-5).  Let  w  ~  N(0,  1)  be  a  n  x  1  vector  of 
independent  N(0, 1)  variables,  and  let  A  be  a  given  m  x  n  matrix  and  Q  a  given 
n  x  n  symmetric  and  idempotent  matrix.  Then  Aw  ~  N(0,  A  A')  and 
w'Qw  ~  x2(r)  where  r  =  tr (Q),  and  these  two  random  variables  are  independ¬ 
ently  distributed  when  AQ  =  0.  We  apply  these  results  with  w=(  1  / cr)e, 
A  =  (X'X^X',  and  Q=M  =  I~  XIX'X^X'  with  tr(M)  =  n-k.  Note  that 
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b  —  P  =  (X'X)-1X'e  =  As  so  that  (b  —  P)/<j  =  Aw,  and  that  e  =  My  = 
M(Xft  +  s)  =  Me  so  that  e'e/c2  =  w'Mw.  As  AM  =  0,  it  follows  that  b  and  e'e  are 
independently  distributed.  Further,  e'e/tt2  ~  yfi(n  —  k)  and  (bj  —  /?•)/ (o^Ojj) 
~  N(0,  1),  and  as  both  terms  are  independent  their  quotient  has  by  definition  the 
Student  t-distribution  with  (n  —  k)  degrees  of  freedom. 


The  f-test 

Let  Sj  =  s^/Ojj  be  the  standard  error  of  bj ;  then 


h  = 


bf  ~  Pj  _  bj  -  Pj  (bj  -  Pj)/ (oy/ajj) 


s^/ajj 


t(n  —  k), 


—j/(n  —  k) 


that  is,  tj  follows  the  t(n  —  k)  distribution. 


(3.43) 


The  t-value  and  significance 

To  test  whether  Xj  has  no  effect  on  y,  which  corresponds  to  P;  =  0,  we  use  the 
above  test  statistic  with  pj  =  0.  That  is,  to  test  the  null  hypothesis  that  pj  =  0 
against  the  alternative  that  pj  ^  0,  we  compute  the  t-value 


h 


(3.44) 


We  reject  the  null  hypothesis  if  tj  differs  significantly  from  zero.  If  the  null 
hypothesis  pj  =  0  is  true,  tj  follows  the  t(n—k)  distribution.  Against  the 
above  two-sided  alternative,  we  reject  the  null  hypothesis  if  \t\  >  c  where  c 
is  the  significance  level  defined  by  P[|f|  >  c\  where  t  ~  t(n  —  k).  This  is  called 
the  t-test,  or  the  test  of  (individual)  significance  of  bj. 


Use  of  the  t- test  and  the  P-value 

As  discussed  in  Section  2.3.1  (p.  100),  for  a  size  of  5  percent  we  can  use  c  =  2 
as  a  rule  of  thumb,  which  is  accurate  if  zz  —  ^  is  not  very  small  (say 
n  —  k  >  30).  In  general  it  is  preferable  to  report  the  P-value  of  the  test.  Of 
course,  if  we  want  to  establish  an  effect  of  Xj  on  y,  then  we  hope  to  be  able  to 
reject  the  null  hypothesis.  However,  we  should  do  this  only  if  there  exists 
sufficient  evidence  for  this  effect.  That  is,  in  this  case  the  size  of  the  test 
should  be  chosen  small  enough  to  protect  ourselves  from  a  large  probability 
of  an  error  of  the  first  type.  Stated  otherwise,  the  null  hypothesis  is  rejected 
only  for  small  enough  P-values  of  the  test.  For  a  significance  level  of  5  per  cent, 
the  null  hypothesis  is  rejected  for  P  <  0.05  and  it  is  not  rejected  for  P  >  0.05. 
In  some  situations  smaller  significance  levels  are  used  (especially  in  large 
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samples),  and  in  other  situations  sometimes  larger  significance  levels  are  used 

(for  instance  in  small  samples). 

Summary  of  computations 

In  regression  we  usually  compute 

•  the  regression  coefficients  b  =  {X'X)~lX'y, 

•  the  standard  error  of  regression  s, 

•  for  each  of  the  coefficients  bj,  j  =  1,  •  •  • ,  k,  their  standard  error  Sj  = 
their  t-value  tj  =  bj/sj,  and  their  P-value  P7  =  P[|  1 1  >  l*/|]  where  t  has  the 
t(n  —  k) -distribution. 

Of  particular  interest  are 

•  the  significance  of  the  regressors  (measured  by  the  P-values), 

•  the  sign  of  significant  coefficients  (indicating  whether  the  corresponding 
regressor  has  a  positive  or  a  negative  effect  on  y), 

•  the  size  of  the  coefficients  (which  can  only  be  judged  properly  in  combin¬ 
ation  with  the  measurement  scale  of  the  corresponding  regressor). 

Other  statistics  like  R 1  may  also  be  of  interest,  as  well  as  other  statistics  that 

will  be  discussed  later  in  the  book. 


3.3.2  Illustration:  Bank  Wages 

We  consider  again  the  salary  data  and  the  linear  model  with  k  =  3  explana¬ 
tory  variables  (a  constant,  education,  and  the  logarithm  of  begin  salary) 
discussed  in  Example  3.3.  We  will  discuss  (i)  the  regression  outcomes  and 
t-tests,  (ii)  presentation  of  the  regression  results,  and  (iii)  results  of  the  model 
with  two  additional  regressors  (gender  and  minority). 

(i)  Regression  outcomes  and  t-tests 

Panel  1  in  Exhibit  3.12  shows  the  outcomes  of  regressing  salary  (in  loga¬ 
rithms)  on  a  constant  and  the  explanatory  variables  education  and  begin 
salary  (the  last  again  in  logarithms).  The  column  ‘Coefficient’  contains  the 
regression  coefficients  bj,  the  column  ‘Std.  Error’  the  standard  errors  Sj,  and 
the  column  ‘t-Statistic’  the  t-values  tj  —  bj/sj.  The  column  denoted  by  ‘Prob’ 
contains  the  P-values  corresponding  to  the  t-values  in  the  preceding 
column  —  that  is,  the  P-value  of  the  hypothesis  that  bj  =  0  against  the  two- 
sided  alternative  that  bj  ^  0-  In  this  example  with  n  =  474  and  k  =  3,  if  t 
follows  the  t( 471)  distribution  and  c  is  the  outcome  of  the  t-statistic,  then  the 
P-value  is  defined  as  the  (two-sided)  probability  P(\t\  >  |c|).  The  P-value 
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Panel  1:  Dependent  Variable:  LOGSAL 

Method:  Least  Squares 

Sample:  1  474 

Included  observations:  474 

Variable 

Coefficient 

Std.  Error 

t- Statistic 

Prob. 

C 

1.646916 

0.274598 

5.997550 

0.0000 

EDUC 

0.023122 

0.003894 

5.938464 

0.0000 

LOGSALBEGIN 

0.868505 

0.031835 

27.28174 

0.0000 

R-squared 

0.800579 

Mean  dependent  var 

10.35679 

Adjusted  R-squared 

0.799733 

S.D.  dependent  var 

0.397334 

S.E.  of  regression 

0.177812 

Sum  squared  resid 

14.89166 

Panel  2:  Dependent  Variable:  LOGSAL 

Method:  Least  Squares 

Sample:  1  474 

Included  observations:  474 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

2.079647 

0.314798 

6.606288 

0.0000 

EDUC 

0.023268 

0.003870 

6.013129 

0.0000 

LOGSALBEGIN 

0.821799 

0.036031 

22.80783 

0.0000 

GENDER 

0.048156 

0.019910 

2.418627 

0.0160 

MINORITY 

-0.042369 

0.020342 

-2.082842 

0.0378 

R-squared 

0.804117 

Mean  dependent  var 

10.35679 

Adjusted  R-squared 

0.802446 

S.D.  dependent  var 

0.397334 

S.E.  of  regression 

0.176603 

Sum  squared  resid 

14.62750 

Exhibit  3.12  Bank  Wages  (Section  3.3.2) 

Results  of  two  regressions.  Panel  1  shows  the  regression  of  salary  (in  logarithms)  on  education 
and  begin  salary  (in  logarithms)  and  Panel  2  shows  the  results  when  gender  and  minority  are 
included  as  additional  explanatory  variables.  The  column  ‘Prob’  contains  the  P-values  for  the 
null  hypothesis  that  the  corresponding  parameter  is  zero  against  the  two-sided  alternative  that 
it  is  non-zero. 

requires  Assumptions  1-7  and,  in  addition,  that  the  null  hypothesis  fij  =  0  is 
true.  All  parameters  are  highly  significant. 

(ii)  Presentation  of  regression  results 

There  are  several  conventions  to  present  regression  results  in  the  form  of  an 
equation.  For  example,  similar  to  what  was  done  in  Example  2.9  (p.  102), 
the  parameter  estimates  can  be  reported  together  with  their  t-values  (in 
parentheses)  in  the  form 

y  =  1.647  +  0.023  xi  +  0.869  X3  +  e. 

(5.998)  (5.938)  (27.282) 

Sometimes  the  parameter  estimates  are  reported  together  with  their  standard 
errors.  Many  readers  are  interested  in  the  question  whether  the  estimates  are 
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significantly  different  from  zero.  These  readers  almost  automatically  start  to 
calculate  the  t-values  themselves.  So  it  is  friendly  to  them  to  present  the  t- 
values  right  away.  In  some  cases,  however,  the  null  hypothesis  of  interest  is 
different  from  zero.  In  such  a  case  the  t-values  give  the  wrong  answers  and 
extra  calculations  are  required.  These  calculations  are  simpler  if  standard 
errors  are  presented.  Those  who  prefer  interval  estimates  are  also  better 
served  by  reporting  standard  errors.  The  obvious  way  out  seems  to  report 
both  the  f-values  and  the  standard  errors,  but  this  requires  more  reporting 
space.  In  any  case,  one  should  always  clearly  mention  which  convention  is 
followed. 

(iii)  Two  additional  regressors 

As  compared  with  the  illustration  in  Section  3.1.7,  we  now  extend  the  set  of 
explanatory  variables  with  X4  (gender)  and  xs  (minority).  Panel  2  of  Exhibit 
3.12  shows  the  regression  outcomes  when  these  variables  are  added.  On  the 
basis  of  the  t-test,  both  the  variable  X4  and  the  variable  X5  have  significant 
effects  (at  5  per  cent  significance  level). 

Note  that,  if  we  add  variables,  the  coefficients  of  the  other  variables  change 
also.  This  is  because  the  explanatory  variables  are  correlated  with  each 
other  —  that  is,  in  the  notation  of  Section  3.2.1  we  have  X\ X2  ^  0  (see 
(3.31)  and  (3.32)).  For  instance,  the  additional  regressor  gender  is  correlated 
with  the  regressors  education  and  begin  salary,  with  correlation  coefficients 
0.36  and  0.55  respectively.  Using  the  notation  of  the  result  of  Frisch-Waugh, 
to  guarantee  that  =  b\  we  should  not  simply  regress  y  on  X\  (as  in  Panel 
1  of  Exhibit  3.12),  but  instead  we  should  regress  Mzy  on  M2X1.  If  important 
variables  like  X4  and  x$  are  omitted  from  the  model,  this  may  lead  to  biased 
estimates  of  direct  effects,  as  was  discussed  in  Section  3.2.3. 


3.3.3  Multicollinearity 

Factors  that  affect  significance 

It  may  happen  that  pj  ^  0  but  that  the  t-test  cannot  reject  the  hypothesis  that 
pj  =  0.  The  estimate  bj  is  then  not  accurate  enough  —  that  is,  its  standard 
error  is  too  large.  In  this  case  the  t-test  does  not  have  enough  power  to  reject 
the  null  hypothesis.  To  analyse  the  possible  causes  of  such  a  situation  we 
decompose  the  variance  of  the  least  squares  estimators  in  terms  of  a  number 
of  components.  We  will  derive  the  result  in  three  steps,  first  for  the  mean, 
then  for  the  simple  regression  model,  and  finally  for  the  multiple  regression 
model. 
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First  case:  Sample  mean 

We  start  with  the  simplest  possible  example  of  a  matrix  X  that  consists  of  one 
column  of  unit  elements.  In  this  case  we  have  b  =  y  and 

var  (b)  —  — . 
n 

We  see  that  for  a  given  required  accuracy  there  is  a  trade-off  between  a1  and 
n.  If  the  disturbance  variance  a 2  is  large  —  that  is,  if  there  is  much  random 
variation  in  the  outcomes  of  y  —  then  we  need  a  large  sample  size  to  obtain  a 
precise  estimate  of  b. 

Second  case:  Simple  regression 

Next  we  consider  the  simple  regression  model  studied  in  Chapter  2, 


y,  =  a  +  fixi  +  Si. 

For  the  least  squares  estimator  b  discussed  there,  the  variance  is  given  by 


var  (b)  = 


Here  we  use  the  expression 


E(x,-x)2  (n-l)sl' 


(3.45) 


.2  =  E  (Xj  -  x)2 
n-  1 


for  the  sample  variance  of  x.  For  a  given  required  accuracy  we  now  see  a 
tradeoff  between  three  factors:  a  large  disturbance  er2  can  be  compensated  for 
by  either  a  large  sample  size  n  or  by  a  large  variance  s2  of  the  explanatory 
variable.  More  variation  in  the  disturbances  s,  gives  a  smaller  accuracy  of  the 
estimators  whereas  more  observations  and  more  variation  in  the  regressor  x, 
lead  to  a  higher  accuracy. 


General  case:  Multiple  regression  (derivation) 

Finally  we  look  at  the  general  multiple  regression  model.  We  concentrate  on  one 
regression  coefficient  and  without  loss  of  generality  we  choose  the  last  one,  since  it 
is  always  possible  to  change  the  order  of  the  columns  of  X.  We  use  the  notation 
introduced  in  Section  3.2.  In  the  current  situation  g  =  1  so  that  the  n  x  g  matrix 
Xi  reduces  to  an  n  x  1  vector  that  we  will  denote  by  X2-  The  n  x  (k  —  1) 
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matrix  X\  corresponds  to  the  first  (k  —  1)  regressors.  We  concentrate  on  the  single 
parameter  /?2  in  the  model 

y  =  Xj/h  +  X2/l2  +  £  =  -Xi/?i  +  j^2xi  +  £■ 

Substituting  this  in  (3.37)  and  using  M\X\  =  0,  it  follows  that 
^2  =  +  (X2M1X2)-1  x'2Mie,  so  that 

var(fe2)  =  <y1{x'2M\X2)~1  ■  (3.46) 

Here  M\X2  has  one  column  and  x'2M\X2  is  the  residual  sum  of  squares  of  the 
auxiliary  regression  X2  =  XjP  +  Mix2  (see  (3.35)).  As  R2  =  1  —  (SSR/SST),  we 
may  write 


SSR  =  SST(1  -  R2). 

If  we  apply  this  result  to  the  auxiliary  regression  X2  =  X\P  +  M\X2  we 
may  substitute  SSR  =  x2M1X2  and  SST  =  ^2  (x2,  —  x2)2  =  («  —  l)s^  .  Denoting 
the  R2  of  this  auxiliary  regression  by  R2  we  obtain  the  following  result. 


The  effect  of  multicollinearity 

In  the  multiple  regression  model  the  variance  of  the  last  regression  coefficient 
(denoted  by  b 2)  may  be  decomposed  as 


var (bi) 


a 


2 


(n-l)sl(l-Rl)- 


If  we  compare  this  with  (3.45),  we  see  three  familiar  factors  and  a  new  one, 
(1  —  R2).  So  var (£0)  increases  with  R2  and  it  even  explodes  if  R2  ]  1.  This  is 
called  the  multicollinearity  problem.  If  xi  is  closely  related  to  the  remaining 
regressors  Xi,  it  is  hard  to  estimate  its  isolated  effect  accurately.  Indeed,  if  R2 
is  large,  then  X2  is  strongly  correlated  with  the  set  of  variables  in  Xi,  so  that 
the  ‘direct’  effect  of  X2  on  y  (that  is,  /?2)  is  accompanied  by  strong  side  effects 
via  Xi  on  y. 

Rewriting  the  above  result  for  an  arbitrary  column  of  X  (except  the 
intercept),  we  get 


var  (bj) 


(n-l)s2x.(l-R2)’ 


(/  =  2,  •  •  • ,  k). 


(3.47) 


where  R2  denotes  the  R2  of  the  auxiliary  regression  of  the  ;th  regressor 
variable  on  the  remaining  (k  —  1)  regressors  (including  the  constant  term) 
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and  s2_  is  the  sample  variance  of  Xj.  So,  accurate  estimates  of  ‘direct’  or 
‘partial’  effects  are  obtained  for  large  sample  sizes,  large  variation  in  the 
relevant  explanatory  variable,  small  error  variance,  and  small  collinearity 
with  the  other  explanatory  variables.  The  factor  1/(1  —  R2)  is  called 
the  variance  inflation  factor — that  is,  the  factor  by  which  the  variance 
increases  because  of  collinearity  of  the  /th  regressor  with  the  other  (k  —  1) 
regressors. 

Interpretation  of  results 

In  many  applications  we  hope  to  find  significant  estimates  of  the  partial 
effects  of  the  explanatory  variables.  If  some  of  the  t-values  of  the  regression 
coefficients  are  small,  this  may  possibly  be  caused  by  high  correlations 
among  the  explanatory  variables,  measured  by  the  coefficients  of  determin¬ 
ation  R 2 .  One  method  to  improve  the  significance  is  to  get  more  data,  if  this  is 
possible.  However,  if  the  purpose  of  the  model  would  be  to  estimate  the  total 
effects  of  some  of  the  variables  (as  opposed  to  partial  effects),  then  another 
solution  is  to  drop  some  of  the  other  explanatory  variables.  In  some  applica¬ 
tions  the  individual  parameters  may  not  be  of  so  much  interest  —  for  in¬ 
stance,  in  prediction.  Then  multicollinearity  is  not  a  very  relevant  issue,  but  it 
may  be  of  interest  to  compare  the  forecast  quality  of  the  full  model  with  that 
of  restricted  versions  where  some  of  the  explanatory  variables  are  omitted. 
Methods  to  choose  the  number  of  explanatory  variables  in  prediction  will  be 
discussed  later  (see  Section  5.2.1). 

“©  Exercises:  S:  3.12;  E:  3.14d. 


3.3.4  Illustration:  Bank  Wages 

To  illustrate  the  factors  that  affect  the  standard  errors  of  least  squares 
estimates  we  consider  once  again  the  bank  wage  data.  Panel  1  of  Exhibit 
3.13  shows  once  more  the  regression  of  salary  on  five  explanatory  variables 
(see  also  Panel  2  of  Exhibit  3.12).  The  standard  errors  of  the  estimated 
parameters  are  relatively  small,  but  it  is  still  of  interest  to  decompose  these 
errors  as  in  (3.47)  to  see  if  this  is  only  due  to  the  fact  that  the  number 
of  observations  n  =  474  is  quite  large.  The  values  R2  of  the  auxiliary  regres¬ 
sions  are  equal  to  R2  =  0.47  (shown  in  Panel  2),  R 2  —  0.59,  R 2  =  0.33,  and 
R2  =  0.07.  Recall  from  Section  3.1.6  that  R2  is  the  square  of  a  correlation 
coefficient,  so  that  these  outcomes  cannot  directly  be  compared  to  the 
(bivariate)  correlations  that  are  also  reported  in  Panel  3  of  Exhibit  3.13. 
Therefore  Panel  3  also  contains  the  values  of  Rj  =  tJr2  and  of  the  square 
root  of  the  variance  inflation  factors  l/,/l  —  R2  that  affect  the  standard 
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Panel  1:  Dependent  Variable:  LOGSAL 

Method:  Least  Squares 

Sample:  1  474 

Included  observations:  474 

Variable 

Coefficient 

Std.  Error 

t- Statistic 

Prob. 

C 

2.079647 

0.314798 

6.606288 

0.0000 

EDUC 

0.023268 

0.003870 

6.013129 

0.0000 

LOGSALBEGIN 

0.821799 

0.036031 

22.80783 

0.0000 

GENDER 

0.048156 

0.019910 

2.418627 

0.0160 

MINORITY 

-0.042369 

0.020342 

-2.082842 

0.0378 

R-squared 

0.804117 

Mean  dependent  var 

10.35679 

Adjusted  R-squared 

0.802446 

S.D.  dependent  var 

0.397334 

S.E.  of  regression 

0.176603 

Sum  squared  resid 

14.62750 

Panel  2:  Dependent  Variable:  EDUC 

Method:  Least  Squares 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-41.5999 7 

3.224768 

-12.90014 

0.0000 

LOGSALBEGIN 

5.707538 

0.339359 

16.81859 

0.0000 

GENDER 

-0.149278 

0.237237 

-0.629237 

0.5295 

MINORITY 

-0.071606 

0.242457 

-0.295337 

0.7679 

R-squared 

0.470869 

Panel  3 

EDUC 

LOGSALBEGIN 

GENDER 

MINORITY 

R,2 

R, 

i/V(i  -  Rj2) 

0.470869 

0.6862 

1.3747 

0.592042 

0.7694 

1.5656 

0.330815 

0.5752 

1.2224 

0.071537 

0.2675 

1.0378 

EDUC 

LOGSALBEGIN 

GENDER 

MINORITY 

1.000000 

0.685719 

0.355986 

-0.132889 

1.000000 

0.548020 

-0.172836 

1.000000 

0.075668 

1.000000 

Exhibit  3.13  Bank  Wages  (Section  3.3.4) 

Panel  1  shows  the  regression  of  salary  (in  logarithms)  on  a  constant,  education,  begin  salary 
(in  logarithms),  gender,  and  minority.  Panel  2  shows  the  regression  of  one  of  the  explanatory 
variables  (EDUC)  on  the  other  ones,  with  corresponding  coefficient  of  determination.  Similar 
regressions  are  performed  (but  not  shown)  and  the  corresponding  R 1  are  reported  in  Panel  3, 
together  with  the  values  of  R  and  of  the  square  root  of  the  variance  inflation  factors.  For 
comparison,  Panel  3  also  contains  the  pairwise  sample  correlations  between  the  explanatory 
variables. 


errors  of  bj  in  (3.47),  for  j  =  2,  3,  4,  5.  The  largest  multiple  correlation  is 
Ri  —  0.77  with  corresponding  square  root  of  the  variance  inflation  factor 
1/^1  -  Rj,  =  1-57.  This  shows  that  some  collinearity  exists.  However,  as 
the  variance  inflation  factors  are  not  so  large,  multicollinearity  does  not  seem 
to  be  a  very  serious  problem  in  this  example. 


3.4  The  F-test 


3.4.1  The  F-test  in  different  forms 

Uses  Section  1.2.3,  1.4.1;  Appendix  A.2-A.4. 

Testing  the  joint  significance  of  more  than  one  coefficient 

In  Section  3.2  we  considered  the  choice  between  the  unrestricted  model 


y  —  Xi/?i  +  X2P2  + £ 

with  estimates  y  =  X\b\  +  X2b2  +  e,  and  the  restricted  model  with  /f2  =  0 
and  estimates  y  =  X\b^  +  eR.  We  may  prefer  to  work  with  the  simpler 
restricted  model  if  b2  is  small.  The  question  is  when  £0  is  small  enough  to 
do  so,  so  that  a  measure  is  needed  for  the  distance  between  £0  and  0.  For  this 
purpose  the  F-test  is  commonly  used  to  test  the  null  hypothesis  that  f52  =  0. 
One  computes  the  F-statistic  to  be  defined  below  and  uses  the  restricted 
model  if  F  does  not  exceed  a  certain  critical  value. 


Derivation  of  the  F-test 

To  derive  the  F-test  for  Flo  'P2  =  0  against  Ffi :  /?2  /  0,  we  use  the  result  in  (3.38), 
which  states  that,  if  /£  =  0, 


b2  =  (X^M1X2)-1^2  Mi£. 

Under  Assumptions  1-7  we  conclude  that  E[b2]  =  0  and  b2  ~  N(0,  V),  where 
V  =  var(f?2)  =  o1  (X'1M\X2y1 .  Let  V~1,/2  be  a  symmetric  matrix  with  the  property 
that  y-1l1W-ll1  =  the  gxg  identity  matrix.  Such  a  matrix  V~T2  is  called  a 
square  root  of  the  matrix  V”1,  and  it  exists  because  V  is  a  positive  definite  matrix. 
As  b2  ~  N(0,  V),  it  follows  that  V~ll2b2  ~  N(0,  I)  —  that  is,  the  g  components  of 
V-^2b2  are  independently  distributed  with  standard  normal  distribution.  By 
definition  it  follows  that  the  sum  of  the  squares  of  these  components  b'2  W 1  b2 
has  the  y2(g)  distribution.  As  V-1  =  a^2X'2M\X2  this  means  that 


b'1X'2M1X2b2/(72  ~  x2(g), 


(3.48) 
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if  the  null  hypothesis  that  /?2  =  0  is  true.  However,  this  still  involves  the  unknown 
parameter  er2  and  hence  it  can  not  be  used  in  practice.  But  if  we  divide  it  by  the 
ratio  e'e/er2  (which  follows  a  y}{n  —  k)  distribution  (see  Section  3.3.1)),  and  if  we 
divide  both  the  numerator  and  the  denominator  by  their  degrees  of  freedom,  the 
two  factors  with  the  unknown  er2  cancel  and  we  obtain 


,3.49) 

e'e/{n  —  k) 

This  follows  an  F(g,  n  —  k)  distribution,  as  it  was  shown  in  Section  3.3.1  that 
s2  =  e'e/(n  —  k)  and  the  least  squares  estimator  b  (and  hence  also  bi)  are  inde¬ 
pendent  (for  an  alternative  proof  see  Exercise  3.7).  Using  (3.34)  we  see  that 
b2X'2MiX2b2  =  e'ReR  —  e'c,  so  that  F  may  he  computed  as  follows. 

Basic  form  of  the  F-test 


F  = 


(e'ReR  -  e'e)/g 
e’  ej  (n  —  k) 


F(g,  n 


k). 


(3.50) 


So  the  smaller  model  with  /?2  =  0  is  rejected  if  the  increase  in  the  sum  of 
squared  residuals  eReR  —  e'e  is  too  large.  The  null  hypothesis  that  /?2  —  0  is 
rejected  for  large  values  of  F  —  that  is,  this  is  a  one-sided  test  (see  Exhibit 
3.14).  A  geometric  impression  of  the  equality  of  the  two  forms  (3.49)  and 
(3.50)  of  the  F- test  is  given  in  Exhibit  3.15.  This  equality  can  be  derived  from 
the  theorem  of  Pythagoras,  as  is  explained  in  the  text  below  the  exhibit. 


f 


Exhibit  3.14  P-value 

F-test  on  parameter  restrictions,  where  g  is  the  number  of  restrictions  under  the  null  hypoth¬ 
esis,  n  is  the  total  number  of  observations,  and  k  is  the  total  number  of  regression  parameters  in 
the  unrestricted  model.  The  P-value  is  equal  to  the  area  of  the  shaded  region  in  the  right  tail, 
and  the  arrow  on  the  horizontal  axis  indicates  the  calculated  F- value. 
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Three-dimensional  geometric  impression  of  the  T-test  for  the  null  hypothesis  that  the 
variables  X2  are  not  significant.  The  projection  of  y  on  the  unrestricted  model  (which 
contains  both  X\  and  X2)  is  given  by  X\b\  +  X2b2  with  residual  vector  e.  The  projection 
of  y  on  the  restricted  model  (which  contains  only  Xi)  is  given  by  X\bR  with  residual 
vector  eR.  The  vectors  eR  and  e  are  both  orthogonal  to  the  variables  Xi,  and  hence  the 
same  holds  true  for  the  difference  eR  —  e.  This  difference  is  the  residual  that  remains 
after  projection  of  X\b\  +  X2b2  on  the  space  of  the  variables  Xi — that  is,  eR  —  e  = 
M\(X\b\  +  X2b2)  =  M\X2b2.  As  the  vector  e  is  orthogonal  to  Xi  and  X2,  it  is  also 
orthogonal  to  MiX2£>2.  The  theorem  of  Pythagoras  implies  that  e'ReR  =  e'e  + 
(M\X2b2)'MiX2b2  =  e'e  +  b2X'2M\X2b2.  The  T-test  for  fS2  =  0  corresponds  to  testing 
whether  the  contribution  M\X2b2  of  explaining  y  in  terms  of  X2  is  significant  —  that  is,  it 
tests  whether  the  length  of  cr  is  significantly  larger  than  the  length  of  e,  or,  equivalently, 
whether  (e'ReR  —  e'e)  differs  significantly  from  0. 


The  F-test  with  R2 

In  the  literature  the  F- test  appears  in  various  equivalent  forms,  and  we  now 
present  some  alternative  formulations.  Let  R 1  and  R\  denote  the  coefficients 
of  determination  for  the  unrestricted  model  and  the  restricted  model  respect¬ 
ively.  Then  e' e  =  SST(  1  -  R2)  and  e'ReR  =  SST(  1  —  R|),  where  the  total  sum 
of  squares  is  in  both  cases  equal  to  SST  =  ^  (y,  —  y)2.  Substituting  this  in 
(3.50)  gives 


n  —  k  R2  —  Rr 
1  -  R2 


§ 


(3.51) 
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So  the  restriction  fi2  =  0  is  not  rejected  if  the  R 2  does  not  decrease  too  much 
when  this  restriction  is  imposed.  This  method  to  compare  the  R 2  of  two 
models  is  preferred  above  the  use  of  the  adjusted  R1  of  Section  3.1.6.  This  is 
because  the  F- test  can  be  used  to  compute  the  P-value  for  the  null  hypothesis 
that  fi2  =  0,  which  provides  a  more  explicit  basis  to  decide  whether  the 
decrease  in  fit  is  significant  or  not. 

The  derivation  of  (3.51)  from  (3.50)  makes  clear  that  the  R2  or  the 
adjusted  R1  can  only  be  used  to  compare  two  models  that  have  the  same 
dependent  variable.  For  instance,  it  makes  no  sense  to  compare  the  R2  of  a 
model  where  y  is  the  measured  variable  with  another  model  where  y  is  the 
logarithm  of  the  measured  variable.  This  is  because  the  total  sum  of  squares 
(SST)  of  both  models  differ  —  that  is,  explaining  the  variation  of  y  around  its 
mean  is  something  different  from  explaining  the  variation  of  log  (y)  around 
its  mean. 


F-  and  t-tests 

The  above  F-statistics  can  be  computed  for  every  partition  of  the  matrix  X  in 
two  parts  X\  and  Xi.  For  instance,  in  the  particular  case  that  X2  consists  of 
only  one  column  (so  that  g  =  1)  F  =  t2  —  that  is,  the  F-statistic  equals  the 
square  of  the  f-statistic  and  in  this  case  the  F- test  and  the  two-sided  f-test 
always  lead  to  the  same  conclusion  (see  Exercise  3.7). 

Test  on  the  overall  significance  of  the  regression 

Several  statistical  packages  present  for  every  regression  the  F-statistic  and  its 
associated  F-value  for  the  so-called  significance  of  the  regression.  This  cor¬ 
responds  to  a  partitioning  of  X  in  X\  and  Xi  where  X\  only  contains  the 
constant  term  (that  is,  Xi  is  a  single  column  consisting  of  unit  elements)  and 
Xi  contains  all  remaining  columns  (so  that  g=  k  —  1).  If  we  denote  the 
components  of  the  (k  —  1)  x  1  vector  fi2  by  the  scalar  parameters 
fii->  •  •  •  j  fk->  then  the  null  hypothesis  is  that  fi2  =  fis  =  '  "  =  /4  =  0,  which 
means  that  none  of  the  explanatory  variables  (apart  from  the  constant  term) 
has  effect  on  y.  So  this  tests  whether  the  model  makes  any  sense  at  all.  In  this 
case,  eR  =  y  —  ly  and  e'ReR  =  SST,  so  that  Rr  =  0.  For  this  special  case  the 
F-statistic  can  therefore  be  written  as 

n  —  k  R2 
=  k  -  1  ‘  1  -  R2  ' 

So  there  is  a  straightforward  link  between  the  F- test  for  the  joint  signifi¬ 
cance  of  all  variables  (except  the  intercept)  and  the  coefficient  of  determin¬ 
ation  R2. 
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Test  of  general  linear  restrictions 

Until  now  we  have  tested  whether  certain  parameters  are  zero  and  we  have 
decomposed  the  regression  matrix  X  =  (Xi  X2)  accordingly.  An  arbitrary 
set  of  linear  restrictions  on  the  parameters  can  be  expressed  in  the  form 
Rfi  =  r,  where  R  is  a  given  g  x  k  matrix  with  rank  g  and  r  is  a  given  g  x  1 
vector.  We  consider  the  testing  problem 


y  =  Xi 3  +  s,  H0:Rp  =  r,  (3.52) 

which  imposes  g  independent  linear  restrictions  on  under  the  null  hypoth¬ 
esis.  Examples  are  given  in  Section  3.4.2. 


Derivation  of  the  F-test 

We  can  test  these  restrictions,  somewhat  in  the  spirit  of  the  f-test,  by  estimating  the 
unrestricted  model  and  checking  whether  Rb  is  sufficiently  close  to  r.  Under 
Assumptions  1-7,  it  follows  that  b  ~  N(/J,  <x2(X'X)_1)  (see  (3.42)).  There¬ 
fore  Rb  -  r  ~  N(R/J  -  r,  a2R(X'X)~1R')  and  we  reject  the  null  hypothesis  if 
Rb  —  r  differs  significantly  from  zero.  If  the  null  hypothesis  is  true,  then 
Rb-r~  N(0,  o2R(X,X)~1R')  and 


(Rb  -  r),[<TlR(Xfxr1RT1(Rb  -  r)  ~  X2(g).  (3.53) 

The  unknown  a2  drops  out  again  if  we  divide  by  e'e/c2,  which  has  the  y2(n  —  k) 
distribution  and  which  is  independent  of  b  and  hence  also  of  the  expression  (3.53). 
By  the  definition  of  the  F -distribution,  this  means  that 

(Rb  -  rYmX'Xr'R'T'iRb  -  r)/g 

e'e/(n  —  k)  [  ’ 

follows  the  F(g,  n  —  k)  distribution  if  the  null  hypothesis  is  true.  Expression  (3.54) 
is  not  so  convenient  from  a  computational  point  of  view.  It  is  left  as  an  exercise 
(see  Exercise  3.8)  that  this  f-test  can  again  be  written  in  terms  of  the  sum  of 
squared  residuals  (SSR)  as  in  (3.50),  where  e'e  is  the  unrestricted  SSR  and  e'ReR  is 
the  SSR  under  the  null  hypothesis. 


Summary  of  computations 

A  set  of  linear  restrictions  on  the  model  parameters  can  be  tested  as  follows. 
Let  n  be  the  number  of  observations,  k  the  number  of  parameters  of  the 
unrestricted  model,  and  g  the  number  of  parameter  restrictions  under  the  null 
hypothesis  (so  that  there  are  only  (k  —  g )  free  parameters  in  the  restricted 
model). 
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Testing  a  set  of  linear  restrictions 

•  Step  1:  Estimate  the  unrestricted  model.  Estimate  the  unrestricted  model 
and  compute  the  corresponding  sum  of  squared  residuals  e'e. 

•  Step  2:  Estimate  the  restricted  model.  Estimate  the  restricted  model  under 
the  null  hypothesis  and  compute  the  corresponding  sum  of  squared  re¬ 
siduals  e'ReR. 

•  Step  3:  Perform  the  F-test.  Compute  the  E-test  by  means  of  (3.50),  and 
reject  the  null  hypothesis  for  large  values  of  E.  The  E-values  can  be  obtained 
from  the  fact  that  the  E-test  has  the  F(g,  n  —  k)  distribution  if  the  null 
hypothesis  is  true  (provided  that  Assumptions  1-7  are  satisfied). 


Exercises:  T:  3.6c,  d,  3.7e,  f,  3.8,  3.10;  E:  3.13,  3.15,  3.19a-e. 


3.4.2  Illustration:  Bank  Wages 

As  an  illustration,  we  consider  again  the  data  discussed  in  previous  examples 
on  salary  (y,  in  logarithms  of  yearly  wage),  education  (xi,  in  years),  begin 
salary  (X3,  in  logarithms  of  yearly  wage),  gender  (X4,  taking  the  value  0  for 
females  and  1  for  males),  and  minority  (X5,  taking  the  value  0  for  non¬ 
minorities  and  1  for  minorities).  We  will  discuss  (i)  the  results  of  various 
models  for  three  data  sets,  (ii)  the  significance  of  the  variable  minority,  (iii) 
the  joint  significance  of  the  regression,  (iv)  the  joint  significance  of  gender 
and  minority,  and  (v)  the  test  whether  gender  and  minority  have  the 
same  effect. 

(i)  Results  of  various  models  for  three  data  sets 

Exhibit  3.16  summarizes  results  (the  sum  of  squared  residuals  and  the  coeffi¬ 
cient  of  determination)  of  regressions  in  the  unrestricted  model 

y  =  Pi  +  Pix2  +  Pix3  +  P4X4  +  Psx5  +  £ 

(see  Panel  1)  and  in  several  restricted  versions  corresponding  to  different 
restrictions  on  the  parameters  ph  i  =  1,  ■  ■  • ,  5  (see  Panel  2).  Most  of  the 
results  of  the  unrestricted  regression  in  Panel  1  of  Exhibit  3.16  were  already 
reported  in  Panel  1  of  Exhibit  3.13  (p.  160). 

In  Panel  2  of  Exhibit  3.16  the  models  are  estimated  for  different  data  sets. 
One  version  uses  the  data  of  all  474  employees,  a  second  one  of  the  employees 
with  custodial  jobs  (job  category  2),  and  a  third  one  of  the  employees  with 
management  jobs  (job  category  3).  Some  of  the  regressions  cannot  be  per¬ 
formed  for  the  second  version.  The  reason  is  that  all  employees  with  a 
custodial  job  are  male,  so  that  X4  =  1  for  all  employees  in  job  category  2. 
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Panel  1:  Dependent  Variable:  LOGSAL 

Method:  Least  Squares 

Sample:  1  474 

Included  observations:  474 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

2.079647 

0.314798 

6.606288 

0.0000 

EDUC 

0.023268 

0.003870 

6.013129 

0.0000 

LOGSALBEGIN 

0.821799 

0.036031 

22.80783 

0.0000 

GENDER 

0.048156 

0.019910 

2.418627 

0.0160 

MINORITY 

-0.042369 

0.020342 

-2.082842 

0.0378 

R-squared 

0.804117 

Mean  dependent 

var 

10.35679 

Adjusted  R-squared 

0.802446 

S.D.  dependent  var 

0.397334 

S.E.  of  regression 

0.176603 

F-statistic 

481.3211 

Sum  squared  resid 

14.62750 

Prob(F-statistic) 

0.000000 

Panel  2 

ALL  (n 

=  474) 

JOBCAT  2  (n  =  27) 

JOBCAT  3 ( 

jn  =  84) 

X-variables 

SSR 

R2 

SSR  R2 

SSR 

R2 

1 

74.6746 

0.0000 

0.1274  0.0000 

5.9900 

0.0000 

1  2 

38.4241 

0.4854 

0.1249  0.0197 

4.8354 

0.1928 

1  2  3 

14.8917 

0.8006 

0.1248  0.0204 

3.1507 

0.4740 

1234 

14.7628 

0.8023 

3.1263 

0.4781 

12  3  5 

14.8100 

0.8017 

0.1224  0.0391 

3.0875 

0.4846 

1  2  3  4  5  (/14  +  j65  =  0) 

14.6291 

0.8041 

3.1503 

0.4741 

1  2  3  4  5  (unrestricted) 

14.6275 

0.8041 

3.0659 

0.4882 

Exhibit  3.16  Bank  Wages  (Section  3.4.2) 


Summary  of  outcomes  of  regressions  where  the  dependent  variable  (logarithm  of  salary)  is 
explained  in  terms  of  different  sets  of  explanatory  variables.  Panel  1  shows  the  unrestricted 
regression  in  terms  of  five  explanatory  variables  (including  a  constant  term).  In  Panel  2,  the 
explanatory  variables  (X)  are  denoted  by  their  index  1  (the  constant  term),  2  (education),  3 
(logarithm  of  begin  salary),  4  (gender),  and  5  (minority).  The  significance  of  explanatory 
variables  can  be  tested  by  E-tests  using  the  SSR  (sum  of  squared  residuals)  or  the  R2  (coefficient 
of  determination)  of  the  regressions.  The  column  ‘X-variables’  indicates  which  variables 
are  included  in  the  model  (in  the  sixth  row  all  variables  are  included  and  the  parameter 
restriction  is  that  fi4  +  /?5  =  0).  The  models  are  estimated  for  three  data  sets,  for  all  474 
employees,  for  the  twenty-seven  employees  in  job  category  2  (custodial  jobs),  and  for  the 
eighty-four  employees  in  job  category  3  (management  jobs). 

Therefore  the  variable  X4  should  not  be  included  in  this  second  version  of  the 
model,  as  X4  =  x\  =  1  and  this  would  violate  Assumption  1.  With  the  results 
in  Exhibit  3.16,  we  will  perform  four  tests,  all  for  the  data  set  of  all  474 
employees.  We  refer  to  Exercise  3.13  for  the  analysis  of  similar  questions  for 
the  sub-samples  of  employees  with  management  or  custodial  jobs. 

(ii)  Significance  of  minority 

Here  the  unrestricted  model  contains  a  constant  term  and  the 
variables  X2,  X3,  X4,  and  xs,  and  we  test  Ho  :  /?5  =  0  against  Hi  :  ^  0. 
This  corresponds  to  (3.52)  with  k  =  5  and  g=  1  and  with 
R  =  (0, 0, 0, 0, 1)  and  r  =  0.  This  restriction  can  be  tested  by  the  t-value  of 
bs  in  Panel  1  of  Exhibit  3.16.  It  is  equal  to  —2.083  with  P-value  0.038,  so 
that  the  hypothesis  is  rejected  at  the  5  per  cent  level  of  significance. 
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As  an  alternative,  we  can  also  compare  the  residual  sum  of  squares 
e! e  —  14.6275  in  the  unrestricted  model  (see  last  row  in  Panel  2  of  Exhibit 
3.16)  with  the  restricted  sum  of  squares  e'Re^  =  14.7628  (see  the  row  with 
X\,  xi,  X3 ,  and  X4  included  in  Panel  2  of  Exhibit  3.16)  and  compute  the  P- test 

(14.7628  -  14.62751/1 
F  =  14.6275/(474  -  5)  =  4338 

with  corresponding  P-value  of  0.038.  The  5  per  cent  critical  value  of  the 
f  (1,469)  distribution  is  3.84,  so  that  the  null  hypothesis  is  rejected  at  5  per 
cent  significance  level.  Note  that  \/4.338  =  2.083  is  equal  (in  absolute  value) 
to  the  t-value  of  bs,  that  \/3.84  =  1.96  is  the  two-sided  5  per  cent  critical 
value  of  the  1(469)  distribution,  and  that  the  P-values  of  the  1-test  and  the 
P- test  are  equal.  If  we  substitute  the  values  R 2  =  0.8041  and  =  0.8023 
into  (3.51),  then  the  same  value  for  P  is  obtained. 


(iii)  Significance  of  the  regression 

Now  we  test  the  joint  significance  of  all  explanatory  variables  by  testing  the 
null  hypothesis  that  /?2  =  /?3  =  =  /?5  =  0.  In  this  case  there  are  g  =  4 

independent  restrictions  and  in  terms  of  (3.52)  we  have 


(° 

1 

0 

0 

0^ 

(°\ 

0 

0 

1 

0 

0 

0 

0 

0 

0 

1 

0 

,  r  = 

0 

\0 

0 

0 

0 

w 

Using  the  values  of  the  sum  of  squared  residuals  in  Panel  2  of  Exhibit  3.16, 
the  P-statistic  becomes 


(74.6746  -  14.6275)/4 
14.6275/(474-5) 


481.321. 


The  5  per  cent  critical  value  of  P(4,  469)  is  2.39  and  so  this  hypothesis  is 
strongly  rejected.  Note  that  the  value  of  this  P-test  has  already  been  reported 
in  the  regression  table  in  Panel  1  in  Exhibit  3.16,  with  a  P-value  that  is 
rounded  to  zero. 


(iv)  Joint  significance  of  gender  and  minority 

Next  we  test  the  null  hypothesis  that  /?4  =  /?5  =  0.  This  corresponds  to  (3.52) 
with  k  =  5  and  g  =  2  and  with 

0  0  0  1  0\ 

0  0  0  0  l)’ 


0 

0 
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To  perform  this  test  for  the  joint  significance  of  the  variables  X4  and  x$,  we 
use  row  3  (for  the  restricted  model)  and  row  7  (for  the  unrestricted  model)  in 
Exhibit  3.16,  Panel  2,  and  find  (using  the  R 2  this  time) 

(0.8041  -  0.8006)/2 
(l-  0.8041)/(474-  5) 

The  P-value  with  respect  to  the  F( 2,  469)  distribution  is  equal  to  P  =  0.016. 
So,  at  5  per  cent  significance  level  we  reject  the  null  hypothesis. 

(v)  Test  whether  gender  and  minority  have  the  same  effect 

In  the  unrestricted  model  the  variable  gender  (X4)  has  a  positive  coefficient 
(0.048).  As  X4  =  0  for  females  and  X4  =  1  for  males,  this  means  that,  on 
average,  males  have  higher  salaries  than  females  (for  the  same  education, 
begin  salary,  and  minority  classification).  Further,  the  variable  minority  has 
a  negative  coefficient  (—  0.042).  As  x$  =  1  for  minorities  and  xs  =  0  for  non¬ 
minorities,  this  means  that,  on  average,  minorities  have  lower  salaries  than 
non-minorities  (for  the  same  education,  begin  salary,  and  gender).  As  the  two 
estimated  effects  are  nearly  of  equal  magnitude,  we  will  test  whether  the 
advantage  of  males  is  equally  large  as  the  advantage  of  non-minorities.  This 
corresponds  to  the  null  hypothesis  that  =  —fis,  or,  equivalently, 
P4  +  /?5  =  0.  In  terms  of  (3.52),  we  have  &  =  5,g=l, 
R  =  (0, 0, 0, 1, 1),  and  r  =  0.  Using  the  last  two  rows  in  Exhibit  3.16,  Panel 
2,  we  get  (in  terms  of  SSR) 

(14.6291  -  14.6275)/1 
f=  14.6275/(474-5)  =  °  °5' 

with  a  P-value  of  P  =  0.821.  So  this  hypothesis  is  not  rejected  —  that  is,  the 
two  factors  of  discrimination  (gender  and  minority)  seem  to  be  of  equal 
magnitude. 


3.4.3  Chow  forecast  test 

Uses  Appendix  A.2-A.4. 


Evaluation  of  predictive  performance:  Sample  split 

One  of  the  possible  practical  uses  of  a  multiple  regression  model  is  to  produce 
forecasts  of  the  dependent  variable  for  given  values  of  the  explanatory 
variables.  It  is,  therefore,  of  interest  to  evaluate  an  estimated  regression 
model  by  studying  its  predictive  performance  out  of  sample.  For  this  purpose 
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full  sample 

◄ - ► 

n  +  g 


estimation 

sample 

prediction 

sample 

Exhibit  3.17  Prediction 

n 

g 

The  full  sample  is  split  into  two  non-overlapping  parts,  the  estimation  sample  with  observa¬ 
tions  that  are  used  to  estimate  the  model,  and  the  prediction  sample.  The  estimated  model  is 
used  to  forecast  the  values  in  the  prediction  sample,  which  can  be  compared  with  the  actually 
observed  values  in  the  prediction  sample. 

the  full  sample  is  split  into  two  parts,  an  estimation  sample  with  n  observa¬ 
tions  used  to  estimate  the  parameters,  and  a  prediction  sample  with  g 
additional  observations  used  for  the  evaluation  of  the  forecast  quality  of 
the  estimated  model.  This  is  illustrated  in  Exhibit  3.17. 

Notation 

The  data  in  the  estimation  sample  are  denoted  by  y\  and  Xi,  where  y\  is  a 
«xl  vector  and  X\  a  n  x  k  matrix.  The  data  in  the  prediction  sample  are 
denoted  by  y2  and  X2,  where  3/2  is  a  g  x  1  vector  and  X2  a  g  x  k  matrix.  Note 
that  this  notation  of  X\  and  X2  differs  from  the  one  used  until  now.  That  is, 
now  the  rows  of  the  matrix  X  are  partitioned  instead  of  the  columns.  We  can 
write 


where  X  is  a  (n  +  g)  x  k  matrix  and  y  is  a  (n  +  g)  x  1  vector.  Since  we  use  y\ 
and  Xi  for  estimation,  we  assume  that  n  >  k,  whereas  g  may  be  any  positive 
integer.  For  the  DGP  over  the  full  sample  we  suppose  that  Assumptions  1-7 
are  satisfied,  so  that 


yi  =  X|/l  -Mi, 

72  =  X2p  +  £2, 

with  E[e\ fij]  =  a2I,  E[s2S2]  =  <J2I,  E[e\e'2]  =  0. 

Prediction  and  prediction  error 

The  estimate  of  ft  is  based  on  the  estimation  sample  and  is  given  by 


b  =  (X,1Xi)_1X,1y1. 


This  estimate  is  used  to  predict  the  values  of  y2  by  means  of  X2b,  with 
resulting  prediction  error 
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f  =  J2-  x2b.  (3.55) 

It  is  left  as  an  exercise  (see  Exercise  3.7)  to  show  that  X2b  is  the  best  linear 
unbiased  predictor  of  y2  in  the  sense  that  it  minimizes  the  variance  of  f .  We 
can  write 


f  =  X20  +  e 2-  X2(X'1X1r1X'1y1  =  e2  -  X2(X'1X1)-1X'1£1, 


so  that  the  prediction  error  f  consists  of  two  uncorrelated  components  — 
namely,  the  disturbance  £2  and  a  component  caused  by  the  fact  that  we  use  b 
rather  than  /?  in  our  prediction  formula  X2b.  As  a  consequence,  the  variance 
of  the  prediction  errors  is  larger  than  the  variance  of  the  disturbances, 

var(f)  =  a1(I  +  X2(X,1X1r1X,2).  (3.56) 

Superficial  observation  could  suggest  that  the  prediction  error  covariance 
matrix  attains  it  minimum  if  X2  =  0,  but  in  a  model  with  an  intercept  this  is 
impossible  (as  the  elements  in  the  first  column  of  X2  all  have  the  value  1).  It 
can  be  shown  that  the  minimum  is  reached  if  all  the  rows  of  X2  are  equal  to 
the  row  of  column  averages  of  Xi  (for  the  regression  model  with  k  =  2  this 
follows  from  formula  (2.39)  for  the  variance  of  the  prediction  error  in 
Section  2.4.1  (p.  105)). 

Prediction  interval 

If  a 2  in  (3.56)  is  replaced  by  the  least  squares  estimator  s1  =  e\e\ /(«  —  k), 
where  e\  =  y\  —  X\b  are  the  residuals  over  the  estimation  sample,  then  one 
can  construct  forecast  intervals  for  y2.  It  is  left  as  an  exercise  (see  Exercise 
3.7)  that  a  (1  —  a)  prediction  interval  for  y2]  for  given  values  X2/  of  the 
explanatory  variables  is  given  by 


X'2-b  —  cs\ 


<  y2j  <  X'2jb  +  cs 


where  djj  is  the  ;th  diagonal  element  of  the  matrix  I  +  X2(X'1  X| )  1X'1  in 
(3.56)  and  c  is  such  that  P[|f|  >  c]  =  a  when  t  ~  t{n  —  k). 


Test  of  constant  DGP 

To  obtain  the  predicted  values  of  y2,  we  assumed  that  the  data  in  the  two  sub¬ 
samples  are  generated  by  the  same  DGP.  This  may  be  tested  by  considering 
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whether  the  predictions  are  sufficiently  accurate.  For  a  cross  section  this  may 
mean,  for  example,  that  we  check  whether  our  model  estimated  using  data 
from  a  number  of  regions  may  be  used  to  predict  the  y  variable  in  another 
region.  For  a  time  series  model  we  check  if  our  model  estimated  using  data 
from  a  certain  period  can  be  used  to  predict  the  y  variable  in  another  period. 
In  all  cases  we  study  conditional  prediction  —  that  is,  we  assume  that  the  X2 
matrix  required  in  the  prediction  is  given.  In  order  to  test  the  predictive 
accuracy,  we  formulate  the  model 


yi  =  Xifi  +  si 
yi  =  X2p  +  y  +  £2, 


where  y  is  a  g  x  1  vector  of  unknown  parameters.  The  foregoing  predictions 
of  y2  are  made  under  the  assumption  that  y  =  0.  We  can  test  this  by  means  of 
an  F- test  in  the  model 


yi 

yi 


(X!  o 

\  X2  I 


(3.57) 


where  it  is  assumed,  as  before,  that  the  model  satisfies  Assumptions  1-7  over 
the  full  sample  of  n  +  g  observations.  To  perform  the  F-test  for  Ho  :  y  =  0 
against  H\  :  y  ^  0,  note  that  Hq  involves  g  restrictions.  The  number  of 
observations  in  the  model  (3.57)  is  n  +  g  and  the  number  of  parameters  is 
k  +  g.  So  the  F- test  in  (3.50)  becomes  in  this  case 

F  =  (■ e'ReR  -  e’e)/g  =  (e'ReR  -  e'e)/g 

e’e/(n  +  g-  (k  +  g))  e'e/(n-k) 

which  follows  the  F(g,  n  —  k)  distribution  when  y  =  0.  Note  that  n  is  the 
number  of  observations  in  the  estimation  sample,  not  in  the  full  sample. 


Derivation  of  sums  of  squares 

To  compute  the  F-test  we  still  have  to  determine  the  restricted  sum  of  squared 
residuals  e'ReR  and  the  unrestricted  sum  of  squared  residuals  e'e.  Under  the  null 
hypothesis  that  y  =  0,  the  model  becomes 


yi 

72 


So  eR  is  obtained  as  the  (n  +  g)  x  1  vector  of  residuals  of  the  regression  over  the 
full  sample  of  (n  +  g)  observations,  and  e'ReR  is  the  corresponding  SSR.  Under  the 
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alternative  hypothesis  that  y  ^  0,  least  squares  in  (3.57)  is  equivalent  to  minimiz¬ 
ing 


S(b,c) 


{  y\  —  X\b  V/yi-Xi b  \ 

Vy2  -X2b-c)  \yi~  X2b  -  cj 

(yi  -  Xib)'(y  1  -  X\b)  +  (y2  -  X2b  -  c)'(y2  -  X2b  -  c). 


The  first  term  is  minimized  by  regressing  y\  onXi — thatis,forfe  =  (X'1Xi)_1X'1y, 
and  the  second  term  attains  its  minimal  value  zero  for  c  =  y2—  X2b.  So  the 
unrestricted  SSR  is  equal  to  e'e  =  (yi  —  X\b)'  (y\  —  X\b)  =  e\e\ — that  is,  the 
SSR  corresponding  to  a  regression  of  the  n  observations  in  the  estimation  sample. 


Chow  forecast  test 

The  test  may  therefore  be  performed  by  running  two  regressions,  an  ‘unre¬ 
stricted’  one  (the  regression  of  y\  on  X\  on  the  estimation  sample  with 

residuals  e\)  and  a  ‘restricted’  one  (the  regression  of 
full  sample  with  residuals  en).  This  gives 


(  yl )  0n  (  )  °"  ^le 


(e'ReR  -  e'^x) / g 
e\e\l[n-k) 


(3.58) 


which  is  called  the  Chow  forecast  test  for  predictive  accuracy.  If  we  use  the 
expression  (3.49)  of  the  F- test  instead  of  (3.50),  then  b2  corresponds  to  the 
estimated  parameters  y  in  the  unrestricted  model.  As  stated  before,  these 
estimates  are  given  by  c  =  y2  —  X2  b  —  that  is ,c  =  f  are  the  prediction  errors 
in  (3.55).  So  the  Chow  test  may  also  be  written  as 


F  f'Vf/g 

e\e\/(n  —  k)  ’ 

where  V is  a  g  x  g  matrix  of  similar  structure  as  in  (3.49)  with  submatrices  of 
explanatory  variables  as  indicated  in  (3.57).  This  shows  that  the  null  hypoth¬ 
esis  that  y  =  0  is  rejected  if  the  prediction  errors  f  are  too  large. 


Comment  on  the  two  regressions  in  the  Chow  forecast  test 

Note  that  in  the  Chow  forecast  test  (3.58)  the  regression  in  the  ‘large’ 
(unrestricted)  model  corresponds  to  the  regression  over  the  ‘small’  sub¬ 
sample  (of  the  first  n  observations),  whereas  the  regression  in  the  ‘small’ 
(restricted)  model  corresponds  to  the  regression  over  the  ‘large’  sample  (of  all 
n  +  g  observations).  The  unrestricted  model  is  larger  in  the  sense  that  it 
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contains  more  parameters  (k  +  g  instead  of  k).  Both  models  apply  to  the 
same  set  of  n  +  g  observations,  and  it  is  precisely  because  the  large  model 
contains  g  parameters  for  the  g  observations  in  the  second  sub-sample  that 
the  estimation  of  the  large  model  can  be  reduced  to  a  regression  over  the  first 
sub-sample. 

Exercises:  T:  3.7g,  h,  3.11;  E:  3.17,  3.19f,  g. 


E 


XM301BWA 


3.4.4  Illustration:  Bank  Wages 

As  an  illustration  we  return  to  the  data  on  bank  wages  and  we  perform  two 
forecast  tests  of  the  salary  model  with  the  explanatory  variables  x\,  xi, 
X3,  X4,  and  xs  described  in  Section  3.4.2.  We  will  discuss  (i)  the  regression 
results,  (ii)  forecast  of  salaries  for  custodial  jobs,  (iii)  forecast  of  salaries  for 
management  jobs,  and  (iv)  a  comparison  of  the  two  forecasts. 

(i)  Regression  results 

We  use  the  results  in  Exhibit  3.18.  This  exhibit  contains  three  regressions,  one 
over  the  full  sample  of  474  employees  (Panel  1),  a  second  one  over  an 
estimation  sample  of  447  employees  working  in  administration  or  manage¬ 
ment  (Panel  2,  the  twenty-seven  employees  with  custodial  jobs  form  the 
prediction  sample  in  this  case),  and  a  third  one  over  an  estimation  sample  of 
390  employees  with  administrative  or  custodial  jobs  (Panel  3,  the  eighty-four 
employees  with  management  jobs  form  the  prediction  sample  in  this  case). 

(ii)  Forecast  of  salaries  for  custodial  jobs 

We  first  perform  a  Chow  forecast  test  by  predicting  the  salaries  of  the  twenty- 
seven  employees  with  custodial  jobs.  The  corresponding  E-statistic  (3.58)  can 
be  computed  from  the  results  in  Panels  1  and  2  in  Exhibit  3.18: 

_  (e'ReR  -  e\ex)lg  _  (14.6275  -  13.9155)/27 

e\ ei/(n-k)  13.9155/(447  -  5)  '  J  ' 

The  P-value  of  the  corresponding  E( 27,  442)  distribution  is  E  =  0.70,  so  that 
the  predictions  are  sufficiently  accurate.  That  is,  the  salaries  for  custodial 
jobs  can  be  predicted  by  means  of  the  model  estimated  for  administrative  and 
management  jobs.  The  scatter  of  twenty-seven  points  of  the  actual  and 
predicted  salaries  is  shown  in  Exhibit  3.19  (a),  and  a  histogram  of  the 
forecast  errors  is  given  in  Exhibit  3.19  (b).  Although  the  great  majority 
of  the  predicted  salaries  are  lower  than  the  actual  salaries,  indicating  a 
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Panel  1:  Dependent  Variable:  LOGSAL 

Method:  Least  Squares 

Sample:  1  474 

Included  observations:  474 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

2.079647 

0.314798 

6.606288 

0.0000 

EDUC 

0.023268 

0.003870 

6.013129 

0.0000 

LOGSALBEGIN 

0.821799 

0.036031 

22.80783 

0.0000 

GENDER 

0.048156 

0.019910 

2.418627 

0.0160 

MINORITY 

-0.042369 

0.020342 

-2.082842 

0.0378 

R-squared 

0.804117 

Mean  dependent  var 

10.35679 

Adjusted  R-squared 

0.802446 

S.D.  dependent  var 

0.397334 

S.E.  of  regression 

0.176603 

Sum  squared  resid 

14.62750 

Panel  2:  Dependent  Variable:  LOGSAL 

Method:  Least  Squares 

Sample:  1  474  IF  JOBCAT  =  1  OR  JOBCAT  =  3 

Included  observations:  447 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

2.133639 

0.323277 

6.600032 

0.0000 

EDUC 

0.029102 

0.004352 

6.687637 

0.0000 

LOGSALBEGIN 

0.808688 

0.037313 

21.67293 

0.0000 

GENDER 

0.028500 

0.020875 

1.365269 

0.1729 

MINORITY 

-0.053989 

0.021518 

-2.508953 

0.0125 

R-squared 

0.813307 

Mean  dependent  var 

10.35796 

Adjusted  R-squared 

0.811617 

S.D.  dependent  var 

0.408806 

S.E.  of  regression 

0.177434 

Sum  squared  resid 

13.91547 

Panel  3:  Dependent  Variable:  LOGSAL 

Method:  Least  Squares 

Sample  (adjusted):  2  474  IF  JOBCAT  =  1  OR  JOBCAT  =  2 

Included  observations:  390  after  adjusting  endpoints 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

3.519694 

0.517151 

6.805930 

0.0000 

EDUC 

0.018640 

0.003774 

4.939607 

0.0000 

LOGSALBEGIN 

0.674293 

0.056446 

11.94577 

0.0000 

GENDER 

0.071522 

0.020327 

3.518492 

0.0005 

MINORITY 

-0.040494 

0.019292 

-2.099032 

0.0365 

R-squared 

0.552306 

Mean  dependent  var 

10.21188 

Adjusted  R-squared 

0.547655 

S.D.  dependent  var 

0.240326 

S.E.  of  regression 

0.161635 

Sum  squared  resid 

10.05848 

Exhibit  3.18  Bank  Wages  (Section  3.4.4) 

Regressions  for  two  forecast  tests.  In  Panel  1  a  model  for  salaries  is  estimated  using  the  data  of 
all  474  employees;  in  Panel  2  this  model  is  estimated  using  only  the  data  of  the  employees  with 
jobs  in  categories  1  and  3  (administration  and  management);  in  Panel  3  this  model  is  estimated 
using  only  the  data  of  the  employees  with  jobs  in  categories  1  and  2  (administration  and 
custodial  jobs). 
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(a) 


a 

o 

tL, 

-J 

< 

3 

o 


LOGSAL 


(c) 


LOGSAL 


(b) 


Series:  FORECER2 

Sample  JOBCAT  =  2 

Observations  27 

Mean 

0.128563 

Median 

0.139186 

Maximum 

0.513183 

Minimum 

-0.222636 

Std.  Dev. 

0.126952 

Skewness 

0.182145 

Kurtosis 

6.067697 

(d) 


Exhibit  3.19  Bank  Wages  (Section  3.4.4) 


Scatter  diagrams  of  forecasted  salaries  against  actual  salaries,  both  in  logarithms  ((a)  and  (c)), 
and  histograms  of  forecast  errors  ((b)  and  (d)),  for  employees  in  job  category  2  ((a)  and  (b), 
forecasts  obtained  from  model  estimated  for  the  data  of  employees  in  job  categories  1  and  3) 
and  for  employees  in  job  category  3  ((c)  and  (d),  forecasts  obtained  from  model  estimated  for 
the  data  of  employees  in  job  categories  1  and  2).  The  diagrams  indicate  that  the  salaries  in  a  job 
category  cannot  be  well  predicted  from  the  salaries  in  the  other  two  job  categories.  In  terms  of 
the  Chow  forecast  test,  the  prediction  errors  in  (a)  and  (b)  are  acceptable,  whereas  those  in  (c) 
and  (d)  are  not. 


downward  bias,  the  forecast  errors  are  small  enough  for  the  null  hypothesis 
not  to  be  rejected.  The  mean  squared  error  of  the  forecasts  (that  is,  the  sum  of 
the  squared  bias  and  the  variance)  is  (0.1286)2  +  (0.1270)2  =  0.0327  (see 
Exhibit  3.19  (b)),  whereas  the  estimated  variance  of  the  disturbances  is 
s2  =  (0.1774)2  =  0.03125  (see  Panel  2  in  Exhibit  3.18).  The  forecast  test  is 
based  on  the  magnitude  of  the  forecast  errors,  and  these  are  of  the  same  order 
as  the  random  variation  s2  on  the  estimation  sample.  This  explains  that  the 
Chow  test  does  not  reject  the  hypothesis  that  custodial  salaries  can  be 
predicted  from  the  model  estimated  on  the  basis  of  wage  data  for  jobs  in 
administration  and  management. 
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(iii)  Forecast  of  salaries  for  management  jobs 

As  a  second  test,  we  predict  the  salaries  for  the  eighty-four  employees  with 
management  positions  from  the  model  estimated  for  administrative  and 
custodial  jobs  (job  categories  1  and  2).  The  regression  results  based  on  the 
390  observations  in  job  categories  1  and  2  are  shown  in  Panel  3  of  Exhibit 
3.18.  The  corresponding  Chow  forecast  test  (3.58)  can  be  computed  from  the 
results  in  Panels  1  and  3  of  Exhibit  3.18: 

P  =  (fW  -  flfi )Jg  =  (14.6275  -  10.0585)/84  = 
e^et/in-k)  10.0585/(390-5) 

The  P-value  of  the  corresponding  E(84,  385)  distribution  is  rounded  to 
P  =  0.0000,  so  that  the  predictions  are  not  accurate.  That  is,  the  salaries  in 
job  category  3  cannot  be  predicted  well  in  this  case. 

The  scatter  of  eighty-four  points  of  the  actual  and  predicted  salaries  is 
shown  in  Exhibit  3.19  (c),  and  the  histogram  of  the  forecast  errors  in  Exhibit 
3.19  (d).  The  values  are  again  mostly  below  the  45°  line,  so  that  salaries  in 
this  category  are  higher  than  would  be  expected  (on  the  basis  of  education, 
begin  salary,  gender,  and  minority)  for  categories  1  and  2.  The  standard  error 
of  the  regression  over  the  390  individuals  in  categories  1  and  2  is  s  =  0.1616 
(see  Panel  3  in  Exhibit  3.18),  whereas  the  root  mean  squared  forecast  error 
over  the  eighty-four  individuals  in  category  3  can  be  computed  from  Exhibit 
3.19  (d)  as  ((0.2006)2  +  (0.1997)2)1/2  =  0.2831.  So  the  forecast  errors  are 
much  larger  than  the  usual  random  variation  in  the  estimation  sample.  Stated 
otherwise,  people  with  management  positions  earn  on  average  more  than 
people  with  administrative  or  custodial  jobs  for  given  level  of  education, 
begin  salary,  gender,  and  minority. 

(iv)  Comparison  of  the  two  forecasts 

Comparing  once  more  Exhibit  3.19  (a)  and  (c),  at  first  sight  the  predict¬ 
ive  quality  seems  to  be  comparable  in  both  cases.  Note,  however,  that 
the  vertical  scales  differ  in  the  two  scatter  diagrams.  Further,  (a)  con¬ 
tains  much  less  observations  than  (c)  (27  and  84  respectively).  Forecast  errors 
become  more  significant  if  they  occur  for  a  larger  number  of  observations. 
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Summary,  further  reading, 
and  keywords 


SUMMARY 
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tory  variable.  The  least  squares  coefficients  measure  the  direct  effect  of  an 
explanatory  variable  on  the  dependent  variable  after  neutralizing  for  the 
indirect  effects  that  run  via  the  other  explanatory  variables.  These  estimated 
effects  therefore  depend  on  the  set  of  all  explanatory  variables  included  in  the 
model.  We  paid  particular  attention  to  the  question  of  which  explanatory 
variables  should  be  included  in  the  model.  For  reasons  of  efficiency  it  is  better 
to  exclude  variables  that  have  only  a  marginal  effect.  The  statistical  proper¬ 
ties  of  least  squares  were  derived  under  a  number  of  assumptions  on  the  data 
generating  process.  Under  these  assumptions,  the  F-test  can  be  used  to  test 
for  the  individual  and  joint  significance  of  explanatory  variables. 


FURTHER  READING 

In  our  analysis  we  made  intensive  use  of  matrix  methods.  We  give  some  references 
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Exercises 


THEORY  QUESTIONS 

3.1  Section  3.1.2) 

In  this  exercise  we  study  the  derivatives  of  (3.6)  and 
prove  the  result  in  (3.7).  For  convenience,  we  write 
X'y  =  p  (a  k  x  1  vector)  and  X'X  =  Q  (a  k  x  k 
matrix),  so  that  we  have  to  minimize  the  function 
f(b )  =  y'y  —  p'b  —  b'p  +  b'Qb  =  y'y  —  2b' p  +  b'Qb. 
Check  every  detail  of  the  following  argument. 

a.  Let  b  increase  to  b  +  b,  where  we  may  choose 
the  elements  of  the  k  x  1  vector  h  as  small  as  we 
like.  Then  f(b  +  b)=f{b)  +  h'(—2p  +  (Q'  +  Q)b) 
+h'Qh. 

b.  This  result  can  be  interpreted  as  a  Taylor  expan¬ 
sion.  If  the  elements  of  b  are  sufficiently  small, 
the  last  term  can  be  neglected,  and  the  central 
term  is  a  linear  expression  containing  the 
k  x  1  vector  of  first  order  derivatives 
jjij  =  —2p  +  (Q'  +  Q)b.  There  are  k  first  order 
derivatives  and  we  follow  the  convention  to  ar¬ 
range  them  in  a  column  vector. 

c.  If  we  apply  this  to  (3.6),  this  shows  that 
§  =  -2  X'y  +  2  X'Xb. 

3.2  (^  Section  3.1.2) 

In  this  exercise  we  prove  the  result  in  (3.10).  The 
vector  of  first  order  derivatives  in  (3.7)  contains  one 
term  that  depends  on  b.  For  convenience  we  write  it 
as  Qb  and  we  partition  the  k  x  k  matrix  Q  =  2X'X 
into  its  columns  as  Q  =  (q\  g2  •  •  •  qk)-  Verify  each 
step  in  the  following  argument. 

a.  Qb  can  be  written  as  Qb  =  q\b\  +  qxbx  +  . . .  + 
qkbk. 

b.  The  derivatives  of  the  elements  of  Qb  with  re¬ 
spect  to  the  scalar  b,  can  be  written  as  a  column 
qt.  To  write  all  derivatives  for  /  =  1,  . . . ,  k  in  one 
formula  we  follow  the  convention  to  write  them 
as  a  ‘row  of  columns'  —  that  is,  we  group  them 
into  a  matrix,  so  that  =  Q  (note  the  prime  in 
the  left-hand  denominator;  this  indicates  that  the 
separate  derivatives  are  arranged  as  a  row). 


c.  With  the  same  conventions  we  get  =  Q  for 
the  Hessian. 

d.  Let  X  be  an  n  x  k  matrix  with  rank  k;  then  prove 
that  the  k  x  k  matrix  X'X  is  positive  definite. 

3.3  (^  Section  3.1.2) 

The  following  steps  show  that  the  least  squares 
estimator  b  =  (X'X)~lX'y  minimizes  (3.6)  without 
using  the  first  and  second  order  derivatives.  In  this 
exercise  b *  denotes  any  k  x  1  vector. 

a.  Let  b *  =  (X'X)~1X'y  +  d ;  then  show  that 
y  —  Xb*  =  e  —  Xd,  where  e  is  a  vector  of  con¬ 
stants  that  does  not  depend  on  the  choice  of  d. 

b.  Show  that  S(bt)  =  e'  e  +  (Xd)'(Xd)  and  that  the 
minimum  of  this  expression  is  attained  if  Xd  =  0. 

c.  Derive  the  condition  for  uniqueness  of  this  min¬ 
imum  and  show  that  the  minimum  is  then  given 
by  d  =  0. 

3.4  (“§?  Section  3.1.4) 

a.  In  the  model  y  =  X/l  +  e,  the  normal  equations 
are  given  by  X'Xb  =  X'y,  the  least  squares  esti¬ 
mates  by  b  =  (X'X)~l  X'y,  and  the  variance  by 
var (b)  =  (t2(X'X)_1.  Work  these  three  formulas 
out  for  the  special  case  of  the  simple  regression 
model  y,  =  a  +  /lx,  +  e,  and  prove  that  these 
results  are  respectively  equal  to  the  normal  equa¬ 
tions,  the  estimates  a  and  b,  and  the  variances  of 
a  and  b  obtained  in  Sections  2.1.2  and  2.2.4. 

b.  Suppose  that  the  k  random  variables  y,  X2, 
X3 ,  •  •  • ,  Xk  are  jointly  normally  distributed 
with  mean  p  and  (non-singular)  covariance 
matrix  2.  Let  the  observations  be  obtained  by 
a  random  sample  of  size  n  from  this  distribu¬ 
tion  N(/i,  2).  Define  the  random  variable 
yc  =  y|{x2,  •  •  • ,  x*,}  —  that  is,  y  conditional  on 
the  values  of  x2,  •  •  • ,  x^.  Show  that  the  n  obser¬ 
vations  yc  satisfy  Assumptions  1-7  of  Section 
3.1.4. 
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3.5  (“®  Section  3.1.5) 

In  some  software  packages  the  user  is  asked  to  specify 
the  variable  to  be  explained  and  the  explanatory  vari¬ 
ables,  while  an  intercept  is  added  automatically.  Now 
suppose  that  you  wish  to  compute  the  least  squares 
estimates  b  in  a  regression  of  the  type  y  =  Xp  +  e 
where  the  n  x  k  matrix  X  does  not  contain  an  ‘inter¬ 
cept  column’  consisting  of  unit  elements.  Define 


where  the  f  columns,  consisting  of  unit  elements 
only,  are  added  by  the  computer  package  and  the 
user  specifies  the  other  data. 

a.  Prove  that  the  least  squares  estimator  obtained 
by  regressing  y*  on  X*  gives  the  desired  result. 

b.  Prove  that  the  standard  errors  of  the  regression 
coefficients  of  this  regression  must  be  corrected 
by  a  factor  a J (2 n  —  k  —  1 )/(«  —  k). 

3.6  Section  3.1.6) 

Suppose  we  wish  to  explain  a  variable  y  and  that  the 
number  of  possible  explanatory  variables  is  so  large 
that  it  is  tempting  to  take  a  subset.  In  such  a  situation 
some  researchers  apply  the  so-called  Theil  criterion 
and  maximize  the  adjusted  R2  defined  by 
R~  =  1  —  |5rfl  —  R1),  where  n  is  the  number  of  ob¬ 
servations  and  k  the  number  of  explanatory  variables. 

a.  Prove  that  R2  never  decreases  by  including  an 
additional  regressor  in  the  model. 

b.  Prove  that  the  Theil  criterion  is  equivalent  with 
minimizing  s,  the  standard  error  of  regression. 

c.  Prove  that  the  Theil  criterion  implies  that  an 
explanatory  variable  x;  will  be  maintained  if 
and  only  if  the  P-test  statistic  for  the  null  hypoth¬ 
esis  pj  =  0  is  larger  than  one. 

d.  Show  that  the  size  (significance  level)  of  such  a 
test  is  larger  than  0.05. 

3.7  (-©  Sections  3.1.5,  3.1.6,  3.2.4,  3.4.1,  3.4.3) 
Some  of  the  following  questions  and  arguments 
were  mentioned  in  this  chapter. 

a.  Prove  the  result  stated  in  Section  3.1.5  that 
hi  >  0  if  the  n  x  k  matrix  X  contains  a  column 
of  unit  elements  and  rank  (X)  =  k. 

b.  Prove  that  R2  (in  the  model  with  constant  term) 
is  the  square  of  the  sample  correlation  coefficient 
between  y  and  y  =  Xb. 


c.  If  a  regression  model  contains  no  constant  term 
so  that  the  matrix  X  contains  no  column  of  ones, 
then  show  that  1  —  ( SSR/SST )  (and  hence  R2 
when  it  is  computed  in  this  way)  may  be  negative. 

d.  Let  y  =  Xi  P1  +  X2p2  +  £  and  let  P1  be  estimated 
by  regressing  y  on  Xi  alone  (the  ‘omitted  vari¬ 
ables’  case  of  Section  3.2.3).  Show  that 
var(hx)  <  var(hi)  in  the  sense  that  var(hi)  - 
var (bR)  is  a  positive  semidefinite  matrix.  When 
are  the  two  variances  equal? 

e.  Show  that  the  F-test  for  a  single  restriction  Pj  =  0 
is  equal  to  the  square  of  the  t-value  of  bj.  Show 
also  that  both  tests  lead  to  the  same  conclusion, 
irrespective  of  the  chosen  significance  level. 

f*.  Consider  the  expression  (3.49)  of  the  F- test  in 
terms  of  the  random  variables  b2X'2MiX2b2 
and  e'e.  Prove  that,  under  the  null  hypothesis 
that  P2  =  0,  these  two  random  variables  are  inde¬ 
pendently  distributed  as  X2(g)  and  %2(n  —  k) 
respectively  by  showing  that  (i)  they  can  be 
expressed  as  e'Qi£  and  e'Q2e,  with  (ii) 
Qi  =  Mi  —  M  and  Q2  =  M,  where  M  is  the 
M-matrix  corresponding  to  X  and  Mi  is  the 
M-matrix  corresponding  to  Xi,  so  that  (iii)  Qi 
is  idempotent  with  rank  g  and  Q2  is  idempotent 
with  rank  (n  —  k),  and  (iv)  Q1Q2  =  0. 

g.  In  Section  3.4  we  considered  the  prediction  of 
y2  for  given  values  of  X2  under  the  assumptions 
that  y  1  =  Xi P  +  £1  and  y2  =  X2p  +  e2  where 
£[fii]  =  0,  £[£2]  =  0,  E[£,fi'J  =  a2I,  £[£24]  =  ff2I, 
and  £[£14]  =  0.  Prove  that  under  Assumptions 
1-6  the  predictor  X2b  with  b  =  (X'1Xi)“1X'1yi  is 
best  linear  unbiased.  That  is,  among  all  predict¬ 
ors  of  the  form  y2  =  £yi  (with  L  a  given  matrix) 
with  the  property  that  E[y2  —  y2]  =  0,  it  minim¬ 
izes  the  variance  of  the  forecast  error  y2  —  y2. 

h.  Using  the  notation  introduced  in  Section  3.4.3, 
show  that  a  ( 1  —  a)  prediction  interval  for  y2j  is 
given  by  X2jb  ±  csy/dp. 

3.8  (“©  Section  3.4.1) 

Consider  the  model  y  =  Xp  +  e  with  the  null  hy¬ 
pothesis  that  Rp  =  r  where  £  is  a  given  g  x  k  matrix 
of  rank  g  and  r  is  a  given  gx  1  vector.  Use  the 
following  steps  to  show  that  the  expression  (3.54) 
for  the  £-test  can  be  written  in  terms  of  residual 
sums  of  squares  as  in  (3.50). 

a.  The  restricted  least  squares  estimator  bj 
minimizes  the  sum  of  squares  (y  —  Xp)'(y  —  Xp) 
under  the  restriction  that  Rp  =  r.  Show  that 
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bg  =  b  —  A(Rb  —  r),  where  b  is  the  unrestricted 
least  squares  estimator  and  A  =  (X'X)-1 

R'mX'X)-1  RT1  ■ 

b.  Let  e  =  y  —  Xb  and  eg  =  y  —  Xbg;  then  show  that 
e'ReR  =  e'e  +  ( Rb  -  r)'[R{X’X)-1RT1{Rb  -  r). 

c.  Show  that  the  E-test  in  (3.54)  can  be  written  as  in 
(3.50). 

d.  In  Section  3 .4.2  we  tested  the  null  hypothesis  that 
j64  +  Ps  =  0  in  the  model  with  k  =  5  explanatory 
variables.  Describe  a  method  to  determine  the 
restricted  sum  of  squared  residuals  e'Reg  in  this 
case. 

3.9  Section  3.2.5) 

This  exercise  serves  to  clarify  a  remark  on  standard 
errors  in  partial  regressions  that  was  made  in 
Example  3.3  (p.  150).  We  use  the  notation  of 
Section  3.2.5,  in  particular  the  estimated  regressions 

(1)  y  =  Xib\  +  X2F2  +  <?,  and 

(2)  M2y  =  (M2X i)bt+e* 

in  the  result  of  Frisch-Waugh.  Here  Xj  and  M2X[ 
are  n  x  (k  —  g)  matrices  and  X2  is  an  n  x  g  matrix. 

a.  Prove  that  var(Ei)  =  var (b„)  =  a1  (X'lM2X\)~x . 

b.  Derive  expressions  for  the  estimated  variance  s2 
in  regression  (1)  and  s2  in  regression  (2),  both  in 
terms  of  e'e. 

c.  Prove  that  the  standard  errors  of  the  coefficients 
b\  in  (1)  can  be  obtained  by  multiplying  the 
standard  errors  of  the  coefficients  b *  in  (2)  by 
the  factor  \J{n  —  k  +  g)/(n  —  k). 

d.  Check  this  result  by  considering  the  standard 
errors  of  the  variable  education  in  the  second 
regression  in  Exhibit  3.7  and  the  last  regression 
in  Exhibit  3.10.  (These  values  are  rounded;  a 
more  precise  result  is  obtained  when  higher  pre¬ 
cision  values  from  a  regression  package  are  used). 

e.  Derive  the  relation  between  the  f -values  of  (1) 
and  (2). 

3.10  (=®  Section  3.4.1) 

In  Section  1.4.2  we  mentioned  the  situation  of  two 
independent  random  samples,  one  of  size  ti\  from 
N(/q,  er2)  and  a  second  one  of  size  n2  from 
N(/(2,  a1).  We  want  to  test  the  null  hypothesis 
Hq  :  /q  =  /i2  against  the  alternative  Hi  :  /q  ^  /(2. 
The  pooled  f-test  is  based  on  the  difference  be¬ 
tween  the  sample  means  y1  and  y2  of  the  two 


sub-samples.  Let  eiei=YltLi  (34  — 3b)2  and 
e2e2  =  S/=ni*+i  (34  —  Vi)2  total  sum  of  squares 

in  the  first  and  second  sub-sample  respectively;  then 
the  pooled  estimator  of  the  variance  is  defined  by 
Sp  =  {e[ei  +  e2e2)/(tii  +  n2  —  2)  and  the  pooled 
f-test  is  defined  by 

»1»2  ~  Jl 

n\  +  m_  sp 

a.  Formulate  the  testing  problem  of  /(,  =  p2  against 
Pi  7^  /(2  in  terms  of  a  parameter  restriction  in  a 
multivariate  regression  model  (with  parameters 
/q  and  /i2). 

b.  Derive  the  E-test  for  Hq  :  /q  =  /<2  in  the  form 
(3.50). 

c.  Prove  that  f2  is  equal  to  the  E-test  in  b  and  that  tp 
follows  the  t(;q  +  n2  —  2)  distribution  if  the  null 
hypothesis  of  equal  means  holds  true. 

d.  In  Example  1.12  (p.  62)  we  considered  the 
FGPA  scores  of  n\  =  373  male  students  and 
n2  =  236  female  students.  Use  the  results 
reported  in  Exhibit  1.6  to  perform  a  test  of  the 
null  hypothesis  of  equal  means  for  male  and 
female  students  against  the  alternative  that 
female  students  have  on  average  higher  scores 
than  male  students. 

3.11  (“«•  Section  3.4.3) 

We  consider  the  Chow  forecast  test  (3.58)  for  the 
case  g  =  1  of  a  single  new  observation  (xn+\,  y„+i). 
The  n  preceding  observations  are  used  in  the  model 
y  1  =  X|  ji  +  e  with  least  squares  estimator  b.  We 
assume  that  Assumptions  1-4  and  7  are  satisfied 
for  the  full  sample  /  =  1,  •  •  • ,  n  +  1,  and  Assump¬ 
tions  5  and  6  for  the  estimation  sample 
i  =  1,  •  •  • ,  n,  whereas  for  the  (n  +  l)st  observation 
we  write 

Vn+ 1  =  x'n+lp  +  y  +  e„+l 

with  y  an  unknown  scalar  parameter.  We  consider 
the  null  hypothesis  that  y  =  0  against  the  alternative 
that  y  0. 

a.  Prove  that  the  least  squares  estimators  of  /I  and  y 
over  the  full  sample  i  =  1,  •••,«  +  1,  are  given 
by  b  and  c  =  yn+\  —  x'n+1b.  Show  that  the  re¬ 
sidual  for  the  ( n+  l)st  observation  is  equal  to 
zero.  Provide  an  intuitive  explanation  for  this 
result. 
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b.  Derive  the  residual  sum  of  squares  over  the  full  c.  Derive  the  F- test  for  the  hypothesis  that  y  =  0. 
sample  j  =  1,  •■•,»+  1  under  the  alternative 
hypothesis. 


EMPIRICAL  AND  SIMULATION  QUESTIONS 

3.12  (“©  Section  3.3.3) 

In  this  simulation  exercise  we  consider  five  variables 
(y,  z,  xi,  X2,  and  X3)  that  are  generated  as  follows. 
Let  n  =  100  and  let  e,,  w,-,  (7,  ~  NID(0,  1)  be  inde¬ 
pendent  random  samples  from  the  standard  normal 
distribution,  i  =  1  Define 


xu  =  5  +  cot  +  0.3(7, 

X2  i  =  10  +  Wi 

X3i  =  5  +  (7, 

Vi  =  Xu  +  X2i  +  £; 
Zi  =  X2  i  +  X3i  +  Ei 


a.  What  is  the  correlation  between  x\  and  X3  ?  And 
what  is  the  correlation  between  X2  and  X3  ? 

b.  Perform  the  regression  of  y  on  a  constant,  X\  and 
X2-  Compute  the  regression  coefficients  and  their 
f-values.  Comment  on  the  outcomes. 

c.  Answer  the  questions  of  b  for  the  regression  of  z 
on  a  constant,  X2  and  X3. 

d.  Perform  also  regressions  of  y  on  a  constant  and 
Xi,  and  of  z  on  a  constant  and  X3.  Discuss  the 
differences  that  arise  between  these  two  cases. 

3.13  (“©  Section  3.4.1) 

In  Section  3.4.2  we  tested  four  different 
hypotheses  —  that  is,  (i)  /15  =  0,  (ii)  /f2  = 

Ih  =  IU  =  Ih  =  0,  (hi)  P4  =  P5  =  0,  and 

(iv)  p4  +  p5  =  0.  As  data  set  we  considered  the 
data  on  all  474  employees  (see  Exhibit  3.16).  Use  a 
significance  level  of  5  per  cent  in  all  tests  below. 

a.  Test  these  four  hypotheses  also  for  the  subset  of 
employees  working  in  management  (job  category 
3),  using  the  results  in  the  last  two  columns  in 
Exhibit  3.16. 

b.  Now  consider  the  hypothesis  (iii)  that  gender  and 
minority  have  no  effect  on  salary  for  employees 
in  management.  We  mention  that  of  the  eighty- 
four  employees  in  management,  seventy  are 
male  non-minority,  ten  are  female-non-minority, 
four  are  male-minority,  and  no  one  is  female- 


minority.  Discuss  the  relevance  of  this  informa¬ 
tion  with  respect  to  the  power  of  the  test  for 
hypothesis  (iii). 

c.  Finally  consider  the  subset  of  employees  with 
custodial  jobs  (job  category  2,  where  all  employ¬ 
ees  are  male).  Use  the  results  in  Exhibit  3.16  to 
test  the  hypothesis  that  p5  =  0.  Test  also  the  hy¬ 
pothesis  that  p2  —  ft  =  Ps  =  0. 

3.14  (”®  Sections  3.2.2,  3.3.3) 

In  this  exercise  we  consider  the  data  set 
on  student  learning  of  Example  1.1  (p.  12) 
for  609  students.  The  dependent  variable 
(y)  is  the  FGPA  score  of  a  student,  and  the  explana¬ 
tory  variables  are  x\  (constant  term),  x2  (SATM 
score),  X3  (SATV  score),  and  x4  (FEM,  with  x4  =  1 
for  females  and  x4  =  0  for  males). 

a.  Compute  the  4x4  correlation  matrix  for  the 
variables  (y,  X2,  X3,  x4). 

b.  Estimate  a  model  for  FGPA  in  terms  of  SATV  by 
regressing  y  on  x\  and  X3.  Estimate  also  a  model 
by  regressing  y  on  x4,  X2,  X3,  and  x4. 

c.  Comment  on  the  differences  between  the  two 
models  in  b  for  the  effect  of  SATV  on  FGPA. 

d.  Investigate  the  presence  of  collinearity  between 
the  explanatory  variables  by  computing  Rj  in 
(3.47)  and  the  square  root  of  the  variance  infla¬ 
tion  factors,  1/  J\  —  Rj,  for  j  =  2, 3, 4. 

3.15  Section  3.4.1) 

In  this  exercise  we  consider  production 
data  for  the  year  1994  of  n  =  26  US  firms 
in  the  sector  of  primary  metal  industries 
(SIC33).  The  data  are  taken  from  E.  J.  Bartelsman 
and  W.  Gray,  National  Bureau  of  Economic  Re¬ 
search,  NBER  Technical  Working  Paper  205,  1996. 
For  each  firm,  values  are  given  of  production  (Y, 
value  added  in  millions  of  dollars),  labour  (L,  total 
payroll  in  millions  of  dollars),  and  capital  ( K ,  real 
capital  stock  in  millions  of  1987  dollars).  A  log-linear 
production  function  is  estimated  with  the  following 
result  (standard  errors  are  in  parentheses). 
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log  ( Y)  =  0.701  +  0.756  log  (L)  +  0.242  log  ( K )  +  e 
(0.415)  (0.091)  (0.110) 

The  model  is  also  estimated  under  two  alternative 
restrictions,  the  first  with  equal  coefficients  for 
log  (L)  and  log  (K)  and  the  second  with  the  sum  of 
the  coefficients  of  log(L)  and  log  (K)  equal  to  one 
(‘constant  returns  to  scale’).  For  this  purpose  the 
following  two  regressions  are  performed. 

log  ( Y)  =  0.010  +  0.524(  log  (L)  +  log  (K))  +  e, 
(0.358)  (0.026) 

log(Y)  — log(K)  =  0.686+0. 756(log(L)-log(K))+e2 
(0.132)  (0.089) 


b.  Comment  on  the  differences  between  the  conclu¬ 
sions  that  could  be  drawn  (without  further  think¬ 
ing)  from  each  of  these  two  regressions. 

c.  Draw  a  partial  regression  scatter  plot  (with  re¬ 
gression  line)  for  salary  (in  logarithms)  against 
gender  after  correction  for  the  variable  education 
(see  Case  3  in  Section  3.2.5).  Draw  also  a  scatter 
plot  (with  regression  line)  for  the  original  (uncor¬ 
rected)  data  on  salary  (in  logarithms)  and  gender. 
Discuss  how  these  plots  help  in  clarifying  the 
differences  in  b. 

d.  Check  the  results  on  regression  coefficients  and 
residuals  in  the  result  of  Frisch-Waugh  (3.39)  for 
these  data,  where  Xj  refers  to  the  variable  X4, 
and  X2  refers  to  the  constant  term  and  the  vari¬ 
able  X2. 


The  residual  sums  of  squares  are  respectively  e'e  = 
1.825544,  e^i  =2.371989,  and  e'2e2  =  1.825652, 
and  the  R 2  are  respectively  equal  to  R2  =  0.956888, 
R 2  =  0.943984,  and  R2  =  0.751397.  In  the 
following  tests  use  a  significance  level  of  5%. 

a.  Test  for  the  individual  significance  of  log  (L)  and 
log(K)  in  the  first  regression.  Test  also  for  the 
joint  significance  of  these  two  variables. 

b.  Test  the  restriction  of  equal  coefficients  by  means 
of  an  F- test  based  on  the  residual  sums  of 
squares. 

c.  Test  this  restriction  also  by  means  of  the  R2. 

d.  Test  the  restriction  of  constant  returns  to  scale 
also  in  two  ways,  one  with  the  F-test  based  on  the 
residual  sums  of  squares  and  the  other  with  the 
F-test  based  on  the  R2. 

e.  Explain  why  the  outcomes  of  b  and  c  are  the 
same  but  the  two  outcomes  in  d  are  different. 
Which  of  the  two  tests  in  d  is  the  correct  one? 


3.16  (“®  Section  3.2.5) 

Consider  the  data  on  bank  wages  of  the 
example  in  Section  3.1.7.  To  test  for 
the  possible  effect  of  gender  on  wage, 
someone  proposes  to  estimate  the  model 
y  =  Pi  +  P4X4  +  £,  where  y  is  the  yearly  wage  (in 
logarithms)  and  X4  is  the  variable  gender  (with 
X4  =  0  for  females  and  X4  =  1  for  males).  As  an 
alternative  we  consider  the  model  with  x2  (educa¬ 
tion)  as  an  additional  explanatory  variable. 

a.  Use  the  data  to  perform  the  two  regressions. 


V  t  f 
% 

XM301BWA 


3.17  (“©  Section  3.4.3) 

In  this  exercise  we  consider  data  on 
weekly  coffee  sales  of  a  certain  brand  of 
coffee.  These  data  come  from  the  same 
marketing  experiment  as  discussed  in  Example  2.3 
(p.  78),  but  for  another  brand  of  coffee  and  for 
another  selection  of  weeks.  The  data  provide  for 
n  =  18  weeks  the  values  of  the  coffee  sales  in  that 
week  (Q,  in  units),  the  applied  deal  rate  (D  =  1  for 
the  usual  price,  D  =  1.05  in  weeks  with  5%  price 
reduction,  and  D  =  1.15  in  weeks  with  15%  price 
reduction),  and  advertisement  (A  =  1  in  weeks  with 
advertisement,  A  =  0  otherwise).  We  postulate  the 
model 


log  (Q)  —  Pi+  P2  l°g  (-D)  +  +  E- 

For  all  tests  below  use  a  significance  level  of  5  %. 

a.  Test  whether  advertisement  has  a  significant 
effect  on  sales,  both  by  a  f-test  and  by  an  F-test. 

b.  Test  the  null  hypothesis  that  fi2  =  1  against  the 
alternative  that  P2  >  1. 

c.  Construct  95%  interval  estimates  for  the  param¬ 
eters  P2  and  /?3. 

d.  Estimate  the  model  using  only  observations  in 
weeks  without  advertisement.  Test  whether  this 
model  produces  acceptable  forecasts  for  the  sales 
(in  logarithms)  in  the  weeks  with  advertisement. 
Note:  take  special  care  of  the  fact  that  the  esti¬ 
mated  model  can  not  predict  the  effect  of  adver¬ 
tisement. 

e.  Make  two  scatter  plots,  one  of  the  actual  values 
of  log  (Q)  against  the  fitted  values  of  d  for  the 


Exercises  185 


twelve  observations  in  the  estimation  sample, 
and  a  second  one  of  log  (Q)  against  the  predicted 
values  for  the  six  observations  in  the  prediction 
sample.  Relate  these  graphs  to  your  conclusions 

in  d. 

3.18  (”®  Section  3.2.5) 

In  this  exercise  we  consider  yearly  data 
(from  1970  to  1999)  related  to  motor  gas¬ 
oline  consumption  in  the  USA.  The  data 
are  taken  from  different  sources  (see  the  table).  Here 
‘rp’  refers  to  data  in  the  Economic  Report  of 
the  President  (see  w3.access.gpo.gov),  ‘ecocb’  to 
data  of  the  Census  Bureau,  and  ‘ecode’  to  data  of 
the  Department  of  Energy  (see  www.economagic. 
com).  The  price  indices  are  defined  so  that  the  aver¬ 
age  value  over  the  years  1982-4  is  equal  to  100. 
We  define  the  variables  y  =log  (SGAS/PGAS), 
x2  =  log  (INC/ PALL),  x3  =  log  (PGAS/PALL), 

x4  =  log  (PPUB/PALL),  x5  =  log  (PNCAR/PALL), 
and  X6  =  log  (PUCAR/PALL).  We  are  interested  in 
the  price  elasticity  of  gasoline  consumption  —  that  is, 
the  marginal  relative  increase  in  sold  quantity  due  to 
a  marginal  relative  price  increase. 


Variable  Definition  Units 

Source 

SGAS 

Retail  sales  gasoline  106  dollars 
service  stations 

ecocb 

PGAS 

Motor  gasoline  retail  cts/gallon 
price,  US  city  average 

ecode 

INC 

Nominal  personal  109  dollars 

disposable  income 

rP 

PALL 

Consumer  price  index  (1982  —  4)/3 
=  100 

rP 

PPUB 

Consumer  price  index  idem 
of  public  transport 

rP 

PNCAR  Consumer  price  index  idem 
of  new  cars 

rP 

PUCAR 

Consumer  price  index  idem 
of  used  cars 

rP 

a.  Estimate  this  price  elasticity  by  regressing 
log  ( SGAS )  on  a  constant  and  log  ( PGAS ).  Com¬ 
ment  on  the  outcome,  and  explain  why  this  out¬ 
come  is  misleading. 

b.  Estimate  the  price  elasticity  now  by  regressing  y 
on  a  constant  and  log  (PGAS).  Explain  the  precise 
relation  with  the  results  in  a.  Why  is  this  outcome 
still  misleading? 


c.  Now  estimate  the  price  elasticity  by  regressing  y 
on  a  constant  and  the  variables  x2  and  x2.  Pro¬ 
vide  a  motivation  for  this  choice  of  explained  and 
explanatory  variables  and  comment  on  the  out¬ 
comes. 

d.  If  y  is  regressed  on  a  constant  and  the  variable  x2 
then  the  estimated  elasticity  is  more  negative 
than  in  c.  Check  this  result  and  give  an  explan¬ 
ation  in  terms  of  partial  regressions.  Use  the  fact 
that,  in  the  period  1970-99,  real  income  has 
mostly  gone  up  and  the  price  of  gasoline  (as 
compared  with  other  prices)  has  mostly  gone 
down. 

e.  Perform  the  partial  regressions  needed  to  remove 
the  effect  of  income  (*2)  on  the  consumption  (y) 
and  on  the  relative  price  (X3).  Make  a  partial 
regression  scatter  plot  of  the  ‘cleaned’  variables 
and  check  the  validity  of  the  result  of  Frisch- 
Waugh  in  this  case. 

f.  Estimate  the  price  elasticity  by  regressing  y  on  a 
constant  and  the  variables  x2,  X3,  X4,  x$,  and  x&. 
Comment  on  the  outcomes  and  compare  them 
with  the  ones  in  c. 

g.  Transform  the  four  price  indices  (PALL,  PPUB, 
PNCAR,  and  PUCAR)  so  that  they  all  have  the 
value  100  in  1970.  Perform  the  regression  of  f 
for  the  transformed  data  (taking  logarithms 
again)  and  compare  the  outcomes  with  the  ones 
in  f.  Which  regression  statistics  remain  the  same, 
and  which  ones  have  changed?  Explain  these 
results. 

3.19  (=®  Sections  3.4.1,  3.4.3) 

We  consider  the  same  data  on  motor  gas¬ 
oline  consumption  as  in  Exercise  3.18 
and  we  use  the  same  notation  as  intro¬ 
duced  there.  For  all  tests  below,  compute  sums  of 
squared  residuals  of  appropriate  regressions,  deter¬ 
mine  the  degrees  of  freedom  of  the  test  statistic,  and 
use  a  significance  level  of  5  % . 

a.  Regress  y  on  a  constant  and  the  variables  x2,  X3, 
X4,  xs,  and  xg.  Test  for  the  joint  significance  of 
the  prices  of  new  and  used  cars. 

b.  Regress  y  on  a  constant  and  the  four  explanatory 
variables  log  (PGAS),  log  (PALL),  log  (INC), 
and  log  (PPUB).  Use  the  results  to  construct  a 
95%  interval  estimate  for  the  price  elasticity  of 
gasoline  consumption. 
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c.  Test  the  null  hypothesis  that  the  sum  of  the  coef¬ 
ficients  of  the  four  regressors  in  the  model  in  b 
(except  the  constant)  is  equal  to  zero.  Explain 
why  this  restriction  is  of  interest  by  relating 
this  regression  model  to  the  restricted  regression 
in  a. 

d.  Show  that  the  following  null  hypothesis  is  not 
rejected:  the  sum  of  the  coefficients  of  log  (PALL), 
log  (INC),  and  log  (PPUB)  in  the  model  of  b  is 
equal  to  zero.  Show  that  the  restricted  model  has 
regressors  log  ( PGAS ),  X2  and  X4  (and  a  constant 
term),  and  estimate  this  model. 

e.  Use  the  model  of  d  (with  the  constant, 
log  (PGAS),  X2  and  X4  as  regressors)  to  construct 


a  95%  interval  estimate  for  the  price  elasticity  of 
gasoline  consumption.  Compare  this  with  the 
result  in  b  and  comment. 

f.  Search  the  Internet  to  find  the  most  recent  year 
with  values  of  the  variables  SGAS,  PGAS, 
PALL,  INC,  and  PPUB  (make  sure  to  use  the 
same  units  as  the  ones  mentioned  in  Exercise 
3.18).  Use  the  models  in  b  and  d  to  construct 
95%  forecast  intervals  of  y  =  log  ( SGAS/PGAS ) 
for  the  given  most  recent  values  of  the  regressors. 

g.  Compare  the  most  recent  value  of  y  with  the  two 
forecast  intervals  of  part  f.  For  the  two  models  in 
b  and  d,  perform  Chow  forecast  tests  for  the  most 
recent  value  of  y. 
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Non-Linear  Methods 


In  the  previous  chapter,  the  finite  sample  statistical  properties  of  regression 
methods  were  derived  under  restrictive  assumptions  on  the  data  generating 
process.  In  this  chapter  we  describe  several  methods  that  can  be  applied  more 
generally.  We  consider  models  with  stochastic  explanatory  variables,  non¬ 
normal  disturbances,  and  non-linearities  in  the  parameters.  Some  of  these 
models  can  be  estimated  by  (non-linear)  least  squares;  other  models  are 
better  estimated  by  maximum  likelihood  or  by  the  generalized  method  of 
moments.  In  most  cases  there  exists  no  closed-form  expression  for  the 
estimates,  so  that  numerical  methods  are  required.  Often  the  finite  sample 
statistical  properties  of  the  estimators  cannot  be  derived  analytically.  An 
approximation  is  obtained  by  asymptotic  analysis  —  that  is,  by  considering 
the  statistical  properties  if  the  sample  size  tends  to  infinity. 
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4.1  Asymptotic  analysis 

Uses  Section  1.3.3. 


4.1.1  Introduction 

Motivation  of  asymptotic  analysis  and  use  in  finite  samples 

In  the  previous  chapter  we  have  seen  that,  given  certain  assumptions  on  the 
data  generating  process,  we  can  derive  the  exact  distributional  properties  of 
estimators  ( b  and  s2)  and  of  tests  (for  instance,  t-  and  F- tests).  However,  these 
assumptions  are  rather  strong  and  one  might  have  a  hard  time  finding 
practical  applications  where  all  these  assumptions  hold  exactly  true.  For 
example,  regressors  typically  do  not  tend  to  be  ‘fixed’  (as  we  do  not  often 
do  controlled  experiments),  but  they  are  often  stochastic  (as  we  rely  on 
empirical  data  that  are  for  some  part  affected  by  random  factors).  Also, 
regression  models  need  not  be  linear  in  the  parameters. 

An  interesting  question  now  is  whether  estimators  and  tests,  which  are  based 
on  the  same  principles  as  before,  still  make  sense  in  this  more  general  setting. 
Strictly  speaking,  if  one  or  several  of  the  standard  Assumptions  1-7  in  Section 
3.1.4  (p.  125-6)  are  violated,  then  we  do  not  know  the  statistical  properties  of 
the  estimators  and  tests  anymore.  A  useful  tool  to  obtain  understanding  of  the 
properties  and  tests  in  this  more  general  setting  is  to  pretend  that  we  can  obtain  a 
limitless  number  of  observations.  We  can  then  pose  the  question  how 
the  estimators  and  tests  would  behave  when  the  number  of  observations 
increases  without  limit.  This,  in  essence,  is  what  is  called  asymptotic  analysis. 
Of  course,  in  practice  our  sample  size  is  finite.  However,  the  asymptotic  proper¬ 
ties  translate  into  results  that  hold  true  approximately  in  finite  samples,  provided 
that  the  sample  size  is  large  enough.  That  is,  once  we  know  how  estimators  and 
tests  behave  for  a  limitless  number  of  observations,  we  also  get  an  approximate 
idea  of  how  they  perform  in  finite  samples  of  usual  size. 

Random  regressors  and  non-normal  disturbances 

As  before,  we  consider  the  linear  model 


y  =  Xfl  +  s. 


(4.1) 


In  the  previous  chapter  we  derived  the  statistical  properties  of  the  least 
squares  estimator  under  the  seven  assumptions  listed  in  Section  3.1.4.  In 
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this  chapter  we  relax  some  of  these  assumptions.  In  particular,  we  consider 
situations  where  the  explanatory  variables  are  random  (so  that  Assumption 
1  is  not  satisfied),  where  the  disturbances  are  not  normally  distributed  (so 
that  Assumption  7  is  violated),  or  where  the  model  is  not  linear  in  the 
parameters  (so  that  Assumption  6  is  violated).  Such  situations  often  occur 
in  practice  when  we  analyse  observed  economic  data.  In  this  section  we 
consider  the  properties  of  the  least  squares  estimator  when  Assumptions 
1  and  7  are  violated.  Non-linear  regression  models  are  discussed  in  Section 
4.2. 

Averaging  to  remove  randomness  and  to  obtain  normality, 
asymptotically 

The  general  idea  is  to  remove  randomness  and  non-normality  asymptotically 
by  taking  averages  of  the  observed  data.  In  Section  1.3.3  (p.  50)  we  discussed 
the  law  of  large  numbers,  which  states  that  the  (random)  sample  average 
converges  in  probability  to  the  (non-random)  population  mean,  and  the 
central  limit  theorem,  which  states  that  this  average  (properly  scaled) 
converges  in  distribution  to  a  normal  distribution.  That  is,  if  Assumptions 
1  and  7  are  violated,  then  under  appropriate  conditions  these  assumptions 
still  hold  true  asymptotically  —  that  is,  if  the  sample  size  grows  without  limit 
(n  — >  oo).  The  results  of  Chapter  3  then  also  hold  true  asymptotically,  and 
they  can  be  taken  as  an  approximation  in  large  enough  finite  samples. 

Before  discussing  further  details  of  asymptotic  analysis,  we  give  an 
example  to  illustrate  that  Assumptions  1  and  7  are  often  violated  in  practice. 

Example  4.1:  Bank  Wages  (continued) 

As  an  illustration,  suppose  that  we  want  to  investigate  the  wage  structure  in 
the  US  banking  sector.  We  will  discuss  (i)  randomness  of  the  regressors  due  to 
sampling,  (ii)  measurement  errors,  and  (iii)  non-normality  of  the  disturbances. 

(i)  Sampling  as  a  source  of  randomness 

To  estimate  a  wage  equation  for  the  US  banking  sector,  we  could  use  the  data 
of  n  —  474  employees  of  a  US  bank  (see  Section  2.1.4  and  Exhibit  2.5  (a) 
(p.  85),  and  Section  3.1.7  and  Exhibit  3.5,  Panel  3  (p.  132)).  If  we  were  to  use 
data  of  employees  of  another  bank,  this  would  of  course  give  other  values 
for  the  dependent  and  explanatory  variables.  That  is,  both  y  and  X  in  (4.1) 
are  obtained  by  sampling  from  the  full  population  of  employees  of  all  US 
banks.  This  means  that  both  y  and  X  are  random,  so  that  Assumption  1  is 
violated. 

To  illustrate  this  idea,  suppose  that  our  data  set  consisted  only  of  a  subset 
of  the  474  employees  considered  before.  We  show  the  results  for  two 
such  sub-samples.  Exhibit  4.1  contains  three  histograms  of  the  explanatory 
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Exhibit  4.1  Bank  Wages  (Example  4.1) 

Histograms  of  variable  education  (EDUC)  [(a),  (c),  and  (e)),  and  scatter  diagrams  of  salary  (in 
logarithms)  against  education  ((b),  ( d ),  and  ( f )),  for  full  sample  (n  =  474  ((a)  and  (b))),  and  for 
two  (complementary)  random  samples  of  size  237  ((c)-(f)). 
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variable  education  (in  (a),  (c),  and  (e))  and  three  corresponding  scatter 
diagrams  (in  (b),  (d),  and  (f)).  The  first  data  set  consists  of  the  full  sample; 
the  other  two  are  the  result  of  a  random  selection  of  the  employees  in  two 
distinct  groups  of  size  237  each.  Clearly,  the  outcomes  depend  on  the  chosen 
sample,  and  both  y  and  X  are  random  because  of  sampling. 

(ii)  Measurement  errors 

Apart  from  sampling  effects,  the  observed  explanatory  variables  often  pro¬ 
vide  only  partial  information  on  the  economic  variables  of  interest.  For 
example,  the  measured  number  of  years  of  education  of  employees  does 
not  take  the  quality  of  the  education  into  account.  The  reported  data  contain 
measurement  errors,  in  the  sense  that  they  give  imperfect  information  on  the 
relevant  underlying  economic  variables. 

(iii)  Non-normality  of  disturbances 

As  concerns  the  Assumption  7,  an  indication  of  the  distribution  of  the 
disturbances  may  be  obtained  by  considering  the  least  squares  residuals. 
For  the  simple  regression  model  of  Section  2.1.4,  where  salaries  (in  loga¬ 
rithms)  are  explained  from  education  alone,  the  histogram  of  the  residuals  is 
given  in  Exhibit  2.5  ( b ).  This  distribution  is  skewed  and  this  may  cast  doubt 
on  the  validity  of  Assumption  7. 


4.1.2  Stochastic  regressors 

Interpretation  of  previous  results  for  stochastic  regressors 

One  way  to  deal  with  stochastic  regressors  is  to  interpret  the  results  that  are 
obtained  under  the  assumption  of  fixed  regressors  as  results  that  hold  true 
conditional  on  the  given  outcomes  of  the  regressors.  The  results  in  Chapters  2 
and  3,  which  were  obtained  under  Assumption  1  of  fixed  regressors,  carry 
over  to  the  case  of  stochastic  regressors,  provided  that  all  assumptions 
and  results  are  interpreted  conditional  on  the  given  values  of  the  regressors. 
To  illustrate  this  idea,  we  consider  the  mean  and  variance  of  the  least 
squares  estimator  b.  In  Section  3.1.4  (p.  126)  we  showed  that,  under 
Assumptions  1-6,  E[b]  =  fl  and  var (b)  =  a2(X'X)_1.  If  the  regressors  in  the 
n  x  k  matrix  X  are  stochastic,  these  results  are  not  valid  anymore.  However, 
suppose  that  we  replace  Assumption  2  that  £[s]  =  0  and  Assumptions  3  and  4 
that  var(e)  =  a1!  by  the  following  two  assumptions  that  are  conditional  on  X: 


£[e|X]  =  0,  var(e|X)  =  a1!. 
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Then  it  holds  true  that 

E[b\X\  =  0,  var(b\X)  =  a2(X'X)~\ 

so  that  the  previous  results  remain  true  if  we  interpret  everything  conditional 
on  X.  To  prove  the  above  two  results,  note  that 

E[b\X]  =  E[P  +  (X,X)_1X,g|X]  =  p  +  (X'X)_1X'£[e|X]  =  p, 
var(£>|X)  =  var(£  +  (X#X)_1X#e|X)  =  (X'X^X'var^XlXlX'Xr1 

=  (x!x)-1xf(o2i)  x(x'x)”1  =  u^x'xr1. 


Derivation  of  statistical  properties  OLS  when  X  and  e  are  independent 

Consider  the  linear  model  y  =  Xfi  +  e  and  suppose  that  Assumptions  2-6  (see 
Section  3.1.4)  are  satisfied,  but  that  Assumption  1  of  fixed  regressors  is  not  valid. 
If  X  is  random  but  independently  distributed  from  e,  then  it  follows  that 


E[b\  =  E[(X'X)-lX'y] 

=  p  +  EKX'Xr'X'e] 

=  P  +  EKX'X^X'Me] 
=  P, 


where  the  third  equality  follows  because  X  and  £  are  independent.  So,  in  this  case 
the  least  squares  estimator  is  still  unbiased.  To  evaluate  the  variance  var (b)  = 
E[(b  -  P)(b  -  P)']  we  write 

b  =  (X'Xj-'X'y  =  [X'X)-1X'{XP  +  s)  =  p  +  (X'X^X's,  (4.2) 

so  that  b  —  =  (X'X)_1X'e.  Using  the  properties  of  conditional  expectations  (see 

Section  1.2.2  (p.  24))  it  follows  by  conditioning  on  X  (denoted  by  £[  •  |X])  that 

var  (b)  =  E[(X,X)~1X'se'X(X'X)~1  ] 

=  £[E[(X'X)_1X,e£,X(X'X)_1  |X]  ] 

=  E[(XIX)-1X'E[££I\X]X(X'X)-1} 

=  £[(X'X)_1X'£[££']X(X'X)_1J 
=  (^EKX’X)-1]. 

The  third  equality  follows  because,  conditional  on  X,  X  is  given,  and  the  fourth 
equality  holds  true  because  X  and  £  are  independent.  The  last  equality  uses  the  fact 
that  £[ee']  =  azI  because  of  Assumptions  2-4.  This  shows  that  the  variance  of 
b  depends  on  the  distribution  of  X. 
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Consequences  of  random  regressors 

In  general  it  may  be  difficult  to  estimate  the  joint  distribution  of  X 
or  to  estimate  E[(X'X)-1].  The  t-  and  F-statistics  as  computed  in  Chapter  3 
will  no  longer  exactly  follow  the  t-  and  F-distributions.  This  also  means  that 
the  P-values  reported  by  statistical  packages  that  are  based  on  these  distribu¬ 
tions  are  no  longer  valid.  In  general,  the  exact  finite  sample  distributions  of  b 
and  of  the  t-  and  F-statistics  cannot  be  determined  analytically.  However,  the 
asymptotic  properties  can  be  determined  under  appropriate  regularity  con¬ 
ditions  (see  Section  4.1.4). 

The  assumption  of  stable  regressors 

In  the  sequel  we  no  longer  assume  that  X  and  e  are  independent.  In  order  to 
investigate  the  asymptotic  properties  of  the  least  squares  estimator,  we  make 
the  following  assumption.  For  the  definition  and  calculation  rules  of  prob¬ 
ability  limits  we  refer  to  Section  1.3.3  (p.  48-9). 

•  Assumption  1 *:  stability  (replaces  Assumption  1  of  fixed  regressors).  The 
regressors  X  may  be  stochastic  and  the  probability  limit  of  ^X'X  exists 
and  is  non-singular,  that  is,  for  some  non-singular  k  x  k  matrix  Q  there 
holds 


X'Xj  =  Q. 

This  stability  assumption  places  restrictions  on  the  variation  in  the 
explanatory  variables  —  that  is,  the  variables  should  vary  sufficiently  (so  that 
Q  is  invertible)  but  not  excessively  (so  that  Q  is  finite).  For  example,  suppose 
that  the  values  of  the  k  x  1  vector  of  regressors  are  obtained  by  random 
sampling  from  a  population  with  zero  mean  and  positive  definite  covariance 
matrix  Q  —  that  is,  from  a  population  where  the  regressors  are  not  perfectly 
collinear.  The  element  (h,j)  of  the  matrix  ^X'X  is  given  by  ^^”=1  the 

(non-centred)  second  moment  of  the  bth  and ;th  explanatory  variable.  The  law 
of  large  numbers  (see  Section  1.3.3  (p.  50))  implies  that  plim(i^"=1  = 

E[xhiXji\  =  Qkj,  so  that  Assumption  1*  holds  true  under  these  conditions. 


4.1.3  Consistency 

The  exogeneity  condition  for  consistency 

If  X  is  random  but  independent  of  e,  then  the  least  squares  estimator  b  is 
unbiased.  If  X  and  £  are  not  independent,  then  b  is  in  general  no  longer 
unbiased,  because  E[b]  =  +  E[(X'X)“1X'e]  and  the  last  term  is  non-zero 
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in  general.  To  investigate  whether  b  is  consistent  —  that  is,  whether 
plim(b)  =  ft  —  we  write  (4.2)  as 

b  =  P+(-X'X )  -X's.  (4.3) 

\n  J  n 

Using  the  rules  for  probability  limits,  it  follows  from  Assumption  1*  that 

plim(fe)  =  0  +  Q_1plim  f-X's 

\n 

so  that  b  is  consistent  if  and  only  if 


-X'e )  =  0.  (4.4) 

n  J 

This  last  condition  is  called  the  orthogonality  condition.  If  this  condition  is 
satisfied,  then  the  explanatory  variables  are  said  to  be  exogenous  (or  some¬ 
times  ‘weakly’  exogenous,  to  distinguish  this  type  of  exogeneity,  which  is 
related  to  consistent  estimation,  from  other  types  of  exogeneity  related  to 
forecasting  and  structural  breaks).  The  ;th  component  of  (4.4)  can  be  written 
as  plim(T^”=|  Xj,s ,),  so  that  this  condition  basically  means  that  the  explana¬ 
tory  variables  should  be  asymptotically  uncorrelated  with  the  disturbances. 


Derivation  of  consistency  of  s2 

□  Under  Assumption  1*  and  condition  (4.4),  s2  (defined  in  (3.22))  is  a  con¬ 
sistent  estimator  of  a1  provided  that  plim(lg'e)  =  plim(l^”=1  e?)  =  a1.  This  can 
be  seen  by  writing  (using  the  notation  and  results  of  Section  3.1.5) 


s 


2 


=  — [—re'e  =  -^—rs’Ms  =  (s's  -  e'X(X'X)"1X'e) 

n  —  k  n  —  k  n  —  k 


For  n  — >  oo  the  first  expression  in  the  last  line  converges  to  1,  the  second  to  a2,  the 
third  and  fifth  to  zero  because  of  condition  (4.4),  and  the  fourth  expression 
converges  to  Q_1  because  of  Assumption  1*.  This  shows  that  plim(s2)  =  a2 
under  the  stated  conditions. 


An  example  where  OLS  is  consistent 

As  an  illustration,  we  consider  the  data  generating  process 


y,  =  fix,  +  e„ 
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where  the  x,  are  IID(0,  q)  and  the  £;  are  IID(0,  er2).  If  the  explanatory  variable 
x,  and  the  disturbance  term  a,  are  independent,  it  follows  that 


£ 


n 


^E[x,  ]£[£,]  =  0, 

1=1 


var (  —X's  )  =  E 


n  n 


i=  1  /=  1 


i=l 


E[xixj]E[£i£j] 

i=  1  /=! 


It  follows  from  the  result  (1.48)  in  Section  1.3.3  (p.  49)  that  in  this  case 
condition  (4.4)  is  satisfied. 


An  example  where  OLS  is  not  consistent 

On  the  other  hand,  if  x,  and  a,  are  correlated  then  the  least  squares  estimator 
is  no  longer  consistent.  This  is  illustrated  by  a  simulation  in  Exhibit  4.2.  Here 
the  explanatory  variable  and  the  disturbance  terms  have  positive  covariance 
(see  (a)),  so  that  y  =  E[x,et]  >  0,  and  the  estimated  slope  b  is  larger  than 
the  slope  p  of  the  DGP  (see  (b)).  This  is  in  line  with  the  fact  that 
plim(b)  =  P  +  q,“1plim(iX'e)  =  P  +  y/q  >  p.  Note  that  the  least  squares 


Exhibit  4.2  Inconsistency 


Effect  of  correlation  between  regressor  and  disturbance  terms.  The  data  are  generated  by 
y  =  x  +  e  (so  that  the  DGP  has  slope  parameter  ji  =  1).  (a)  shows  the  scatter  diagram  of 
the  disturbance  terms  e  (EPS)  against  the  regressor  x,  which  are  positively  correlated. 
(b)  contains  the  scatter  diagram  of  y  against  x  with  the  regression  line  and  the  systematic 
relation  y  =  x  (dashed  line)  of  the  DGP.  This  shows  that  least  squares  overestimates  the 
slope  parameter,  (c)  contains  the  scatter  diagram  of  the  least  squares  residuals  (RES)  against 
x,  which  shows  that  the  correlation  between  x  and  the  disturbances  (e)  cannot  be  detected 
in  this  way. 
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estimate  is  obtained  from  the  normal  equation  ^^x,(y,  —  bxj)  =  j^2,Xiet  =  0, 
where  e,  =  y,  —  bx,  are  the  least  squares  residuals.  This  means  that  the 
positive  correlation  between  x,  and  s,  cannot  be  detected  from  the  least 
squares  residuals  (see  (c)).  Therefore,  in  practice  this  issue  cannot  be  tested 
by  simply  looking  at  the  residuals.  Tests  for  exogeneity  will  be  discussed  later 
(see  Section  5.7.3  (p.  411)). 

=®>  Exercises:  T:  4.1,  4.3,  4.4;  S:  4.8. 


4.1.4  Asymptotic  normality 

Derivation  of  asymptotic  distribution 

To  determine  the  asymptotic  distribution  of  b,  it  is  helpful  to  rewrite  (4.3)  as 

Mb- P)  =  (1X'x)  *  X'e.  (4.5) 

\n  J  yjn 

Under  Assumption  1*,  the  first  factor  in  (4.5)  converges  in  probability  to  Q_1,  so 
that  it  remains  to  determine  the  asymptotic  distribution  of  -^X'e.  It  can  be  shown 
that,  under  Assumptions  1*  and  2-6  and  some  additional  weak  regularity  condi¬ 
tions,  there  holds 

-^X'e4N(0,<72Q).  (4.6) 

V« 

The  result  in  (4.6)  is  based  on  generalizations  of  the  central  limit  theorem.  We  do 
not  discuss  the  precise  regularity  conditions  needed  for  this  general  result,  but  we 
analyse  the  simple  regression  model  y,  =  /be,  +  £,  in  somewhat  more  detail. 

□  Illustration:  Simple  regression  model 

Suppose  that  the  disturbances  £,  are  independently  but  not  normally  distributed  and 
that  the  (single)  explanatory  variable  x,  is  non-stochastic.  In  this  case 
^X'e  =  where  the  random  variables  z,  =  x,e,  are  inde¬ 

pendently  distributed  with  mean  E[z,-]  =  E[x,e,]  =  0  and  variance  E[z2]  = 
E[xfsf]  =  a2xf.  In  particular,  if  x,  =  1  (so  that  the  model  contains  only  the  constant 
term),  then  ^X'e  =  and,  according  to  the  central  limit  theorem  (1.50)  in 

Section  1.3.3  (p.  50),  it  follows  that  this  converges  in  distribution  to  N(0,  a1).  As 
Q  =  1  in  this  situation  this  shows  (4.6)  for  this  particular  case.  If  x,  is  not  constant, 
we  can  use  a  generalized  central  limit  theorem  (see  Section  1.3.3),  which  states 
that  -^YfZi  converges  in  distribution  to  N(0,  a2)  with  variance  equal  to 
a2  =  lim^oo  1  Ym=i  var (z,-)  =  a2  lim(l^”=i  xf)  =  ff2Q’  which  proves  (4.6)  also 
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for  this  case.  Note  that  asymptotic  normality  is  obtained  independent  of  the 
distribution  of  the  disturbances  —  that  is,  even  if  the  disturbances  are  not  nor¬ 
mally  distributed.  The  result  in  (4.6)  can  be  proved  under  much  weaker  condi¬ 
tions,  with  correlated  disturbances  and  stochastic  X,  hut  the  orthogonality 
condition  is  crucial  to  obtain  the  zero  mean  in  (4.6). 

Asymptotic  distribution  of  OLS  estimator 

If  the  result  on  the  asymptotic  distribution  in  (4.6)  holds  true,  it  follows  from 
(4.5)  and  Assumption  1*  that 


Mb  -  jS)4n(0 yQ^QQ-1)  =  N(0,  (t2Q-1).  (4.7) 

Approximate  distribution  in  finite  samples 

We  say  that  the  rate  of  convergence  of  b  to  ft  is  y/n.  If  the  sample  size  n  is  large 
enough,  the  finite  sample  distribution  of  b  can  be  approximated  by 
N(/I,  ^  Q-1).  It  depends  on  the  application  to  hand  which  size  of  the  sample 
is  required  to  justify  this  approximation.  For  instance,  for  the  case  of  random 
samples  discussed  in  Section  1.3.3,  the  distribution  of  the  sample  mean  is 
often  well  approximated  by  a  normal  distribution  for  small  sample  sizes  like 
n  =  50.  On  the  other  hand,  if  the  model  for  example  contains  many  regres¬ 
sors,  then  larger  sample  sizes  may  be  required.  The  situation  is  somewhat 
comparable  to  the  discussion  in  Section  3.3.3  on  multicollinearity  (p.  158). 
The  expression  (3.47)  for  the  variance  shows  that  the  sample  size  required  to 
get  a  prescribed  precision  depends  on  the  amount  of  variation  in  the  individ¬ 
ual  regressors  and  on  the  correlations  between  the  regressors. 

Practical  use  of  asymptotic  distribution 

To  apply  the  normal  approximation  in  practice,  the  (unknown)  matrix  Q  is 
approximated  by  ^X'X.  This  gives  the  approximate  distribution 

bfa  N(p,a2(X'Xr1).  (4.8) 

This  means  that  the  statistical  results  of  Chapter  3  —  for  example,  the  t-test 
and  the  F-test  that  are  based  on  the  assumption  that  b  n  (p,  (j^x'xr1)  — 
remain  valid  as  an  asymptotic  approximation  under  the  following  four 
assumptions. 

(i)  plimQX'X)  =  Q  exists  and  is  invertible  (Assumption  1*), 

(ii)  £[e]  =  0,  var(fi)  =  a 2 1  (Assumptions  2-4), 

(iii)  y  =  X/]  +  e  (Assumptions  5  and  6), 

(iv)  plimQX'e)  =  0  (orthogonality  condition). 
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The  standard  inference  methods  for  least  squares  are  still  valid  for  stochastic 
regressors  and  non-normal  disturbances,  provided  that  these  four  conditions 
are  satisfied. 


4.1.5  Simulation  examples 

As  an  illustration,  we  perform  some  simulation  experiments  with  the  model 

y,  =  Xj  +  e„  /'=  !,•••,«. 


So  our  data  generating  process  has  parameters  /l  =  1  and  a1  —  1. 
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Exhibit  4.3  Simulation  Example  (Section  4.1.5) 


Consistency  and  asymptotic  normality.  Estimates  of  the  slope  parameter  (b,  denoted  by  B  in 
(a)  and  ( b )),  a  normalized  version  (BNORM,  i.e.  y/ti(b  —  1)  in  (c)  and  (d)),  and  estimates  of  the 
disturbance  variance  (s2,  denoted  by  S2  in  ( e )  and  (f)),  for  simulated  data  that  satisfy  the 
orthogonality  condition.  The  number  of  simulation  runs  is  10,000,  and  the  histograms  show 
the  distribution  of  the  resulting  10,000  estimates.  The  sample  size  is  n  =  25  in  (a),  (c),  and  (e) 
and  n  =  100  in  (b),  ( d ),  and  (f)  (note  the  differences  between  the  scales  on  the  horizontal  axis 
for  both  sample  sizes). 
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1.500786 

Maximum 
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0.036497 

Kurtosis 
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Exhibit  4.3  (Contd.) 

Inconsistency  when  orthogonality  is  violated.  Estimates  of  the  slope  parameter  (b,  denoted  by 
B  in  (g)  and  (h)),  a  normalized  version  (BNORM,  i.e.  \J n  (b  —  1)  in  (*)  and  (;')),  and  estimates 
of  the  disturbance  variance  (s2,  denoted  by  S2  in  (k)  and  (l)),  for  simulated  data  that  do  not 
satisfy  the  orthogonality  condition.  The  number  of  simulation  runs  is  10,000,  and  the  histo¬ 
grams  show  the  distribution  of  the  resulting  10,000  estimates.  The  sample  size  is  n  =  25  in  (g), 
(*),  and  (k)  and  n  =  100  in  (h),  and  (/)  (note  the  differences  between  the  scales  on  the 
horizontal  axis  for  both  sample  sizes). 


Simulations  with  stable  random  regressors 

First  we  consider  simulations  where  the  values  of  (x„  st)  are  obtained 
by  a  random  sample  of  the  bivariate  normal  distribution  with  mean  zero, 
unit  variances,  and  covariance  p.  So  the  regressor  x,  is  random,  and  it  is  also 
stable  because  the  law  of  large  numbers  implies  that  plimQ^"=|  xf)  = 
E[x 2]  =  1. 

We  consider  two  experiments,  one  experiment  with  p  =  0  (so  that  the 
regressor  satisfies  the  orthogonality  condition)  and  another  experiment 
with  p  =  0.5  (so  that  the  orthogonality  condition  is  violated).  Exhibit  4.3 
shows  histograms  (based  on  10,000  simulations)  of  the  values  of 
b,  y/n(b  —  1),  and  s2,  for  sample  sizes  n  =  25  (a,  c,  e,  g,  i,  k)  and  n  =  100 
(b,  d,  f,  h,  j,  I).  The  histograms  (a-f)  indicate  the  consistency  and 
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Exhibit  4.4  Simulation  Example  (Section  4.1.5) 


Estimates  of  the  slope  parameter  ( b ,  denoted  by  B  in  ( a)-(d ))  and  a  normalized  version 
(BNORM,  i.e.  ^/n(b  -  1)  in  (e)-(h))  for  two  data  generating  processes  that  do  not  satisfy 
Assumption  1*.  (a),  (c),  (e),  {g),  and  (i)  are  for  the  model  with  linear  trend  and  ( b ),  (d),  ( f ),  (b), 
and  (/')  are  for  the  model  with  hyperbolic  trend,  (a)-(b)  show  the  estimates  of  b  for  sample  size 
n  =  25  and  (c)-(d)  for  n  =  100;  (e)-(f)  show  the  outcomes  of  BNORM  for  n  =  25  and  (g)-(h) 
for  n  =  100;  (i)-(j)  show  scatter  diagrams  for  a  sample  of  size  n  =  100  of  the  models  with 
linear  trend  ( i )  and  with  hyperbolic  trend  (;'). 
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approximate  normality  when  the  orthogonality  condition  is  satisfied,  and  the 
histograms  (g-l)  indicate  the  inconsistency  of  both  b  and  s2  when  the  ortho¬ 
gonality  condition  is  violated.  (Note  that  the  horizontal  axis  differs  among 
the  different  histograms,  so  that  the  width  of  the  distributions  is  more  easily 
compared  by  comparing  the  reported  standard  deviations  of  the  outcomes.) 

Simulations  with  regressors  that  are  not  stable 

Next  we  generate  data  from  the  model  y,  =  x,  +  £;  with  x-,  =  i  (a  linear  trend) 
and  with  x,  =  1  /*  (a  hyperbolic  trend).  In  both  cases  the  disturbances  £,  are 
NID(0,  1),  and  the  least  squares  estimator  b  is  unbiased  and  efficient.  Note 
that  these  trend  models  do  not  satisfy  the  stability  Assumption  1*,  as 
limQ^L,  *2)  =  00  and  limGlIXi  2  2)  =  0-  In  the  linear  trend  model,  the 
rate  of  convergence  of  b  to  /f  is  equal  to  n^/n  (instead  of  y/n),  and  in  the 
hyperbolic  trend  model  the  estimator  b  does  not  converge  to  /?  (the  proof  is 
left  as  an  exercise  (see  Exercise  4.2)). 

Exhibit  4.4  shows  the  histograms  of  b  and  \Jn{b  —  1)  for  10,000 
simulations  of  both  models,  with  sample  sizes  n  =  25  (a,  b,  e,  f)  and 
n  =  100  (c,  d ,  g,  h).  By  comparing  the  reported  standard  deviations  in  the 
histograms,  it  is  seen  that  for  the  linear  trend  the  distribution  of  \fn{b  —  1) 
shrinks  to  zero  for  n  — >  oo  (see  (e)  and  (g)),  whereas  for  the  hyperbolic  trend 
the  distribution  of  \fn{b  —  1)  does  not  converge  for  n  — >  oo  (see  (f)  and  (h)). 
For  the  hyperbolic  trend  data  the  least  squares  estimator  b  is  not  consistent 
(see  ( b )  and  (d)),  as  the  observations  x,  =  j  of  the  explanatory  variable  do  not 
contain  sufficient  variation  for  i  — >  oo. 
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4.2  Non-linear  regression 


4.2.1  Motivation 

Assumptions  on  the  data  generating  process 

In  this  section  we  consider  regression  models  that  are  non-linear  in  the 
parameters.  This  means  that  Assumption  6  in  Section  3.1.4  (p.  125)  no 
longer  holds  true.  Throughout  this  section  we  suppose  that  the  stability 
Assumption  1*  of  Section  4.1.2  and  the  Assumptions  2-5  of  Section  3.1.4 
are  satisfied  and  that  the  regressors  satisfy  the  orthogonality  condition  (4.4) 
in  Section  4.1.3.  We  now  present  two  examples  motivating  the  use  of  non¬ 
linear  models. 

Example  4.2:  Coffee  Sales  (continued) 

In  this  example  we  consider  marketing  data  on  coffee  sales.  These  data  are 
obtained  from  a  controlled  marketing  experiment  in  stores  in  suburban  Paris 
(see  A.  C.  Bemmaor  and  D.  Mouchoux,  ‘Measuring  the  Short-Term  Effect  of 
In-Store  Promotion  and  Retail  Advertising  on  Brand  Sales:  A  Factorial 
Experiment’,  Journal  of  Marketing  Research,  28  (1991),  202-14).  The 
question  of  interest  is  whether  the  sensitivity  of  consumers  to  price  reduc¬ 
tions  depends  on  the  magnitude  of  the  price  reduction.  Stated  in  economic 
terms,  the  question  is  whether  the  price  elasticity  of  demand  for  coffee  is 
constant  or  whether  it  depends  on  the  price.  We  will  discuss  (i)  the  data, 
(ii)  the  linear  model  with  constant  elasticity,  and  (iii)  a  non-linear  model  with 
varying  elasticity. 

(i)  Data 

Exhibit  4.5  shows  scatter  diagrams  of  weekly  sales  (q)  of  two  brands  of  coffee 
against  the  applied  deal  rate  ( d )  in  these  weeks  (both  variables  are  taken  in 
natural  logarithms).  The  data  for  brand  2  (in  (b))  were  discussed  before  in 
Example  2.3  (p.  78).  The  deal  rate  d  is  defined  as  d  =  1  if  no  price  reduction 
applies,  d  =  1 . 05  if  the  price  reduction  is  5  per  cent,  and  d  =  1 . 1 5  if  the  price 
reduction  is  15  per  cent.  For  each  brand  there  are  n  =  12  observations,  six 
with  d  =  1,  three  with  d  =  1.05,  and  three  with  d  =  1.15.  For  both  brands, 
two  of  the  sales  figures  for  d  =  1.15  are  nearly  overlapping  (the  lower  figure 
for  brand  1  in  (a)  and  the  higher  figure  for  brand  2  in  (b)). 
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Exhibit  4.5  Coffee  Sales  (Example  4.2) 


Scatter  diagrams  for  two  brands  of  coffee,  brand  1  (a)  and  brand  2  (b).  The  variable  on  the 
vertical  axis  is  the  logarithm  of  sales  (in  units  of  coffee),  the  variable  on  the  horizontal  axis  is 
the  logarithm  of  the  deal  rate  (deal  rates  of  1.05  and  1.15  correspond  to  price  reductions  of 
5%  and  15%  respectively).  Both  scatter  diagrams  contain  twelve  points,  but  for  both  brands 
two  observations  for  deal  rate  15%  are  nearly  overlapping  (for  brand  1  the  ones  with  the 
lower  sales  and  for  brand  2  the  ones  with  the  higher  sales). 


(ii)  Linear  model  with  constant  elasticity 

A  simple  linear  regression  model  is  given  by 

log  (q)  =  Pi+  P2  log  (d)  +  £ 

(here  we  suppress  the  observation  index  i  for  ease  of  notation).  In  this  model 
/f2  is  the  derivative  of  log  (q)  with  respect  to  log  ( d )  —  that  is, 

_  a  log  (q)  _  dq/q 
1  cllog  (d)  dd/d ’ 

which  is  the  demand  elasticity  with  respect  to  the  deal  rate.  So  the  slope  in 
the  scatter  diagram  of  log  (q)  against  log  (d)  corresponds  to  the  demand 
elasticity. 

(iii)  Non-linear  model  with  varying  elasticity 

The  scatter  diagrams  in  Exhibit  4.5  suggest  that  for  both  brands  the  elasticity 
may  not  be  constant.  The  slope  seems  to  decrease  for  larger  values  of  log  (d), 
so  that  the  elasticity  may  be  decreasing  for  higher  deal  rates.  A  possible  way 
to  model  such  a  rate-specific  elasticity  is  given  by  the  equation 


log  (q)  =  Pi+^r  (d^3  —  1)  +  £• 

P  3 


(4.9) 
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As  ( d ^  —  l)//?3  — »  log(d)  for  /?3  — >  0,  the  limiting  model  for  /?3  =  0  is  the 
linear  model  log  (q)  =  P\  +  log  (d)  +  £-  The  deal  rate  elasticity  in  (4.9)  is 
equal  to 


<9  log  q  d\ogq 
d  log  d  dd/d 


d\ogq 

d~ar= 


dp2dPi  1 


Pidlh- 


The  null  hypothesis  of  constant  elasticity  corresponds  to  /?3  =  0  —  that  is,  the 
linear  model.  The  non-linear  model  (4.9)  provides  a  simple  way  to  model  a 
non-constant  elasticity.  This  example  will  be  further  analysed  in  Sections 
4.2.5  and  4.3.9. 


E 
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Example  4.3:  Food  Expenditure 

As  a  second  example  we  consider  budget  data  on  food  expenditure  of  groups 
of  households.  Here  the  question  of  interest  is  whether  the  food  expenditure 
depends  linearly  on  household  income  or  whether  this  dependence  becomes 
weaker  for  higher  levels  of  income.  Such  a  decreasing  effect  of  income  on 
food  consumption  may  be  expected  because  households  with  higher  incomes 
can  afford  to  spend  relatively  more  on  other  expenses  that  provide  a  higher 
marginal  utility  than  additional  food. 

Exhibit  4.6  shows  a  scatter  diagram  of  the  fraction  of  consumptive  ex¬ 
penditure  of  households  spent  on  food  against  total  consumptive  expenditure 
(measured  in  $10,000).  These  data  are  analysed  (amongst  others)  in  a  special 
issue  of  the  Journal  of  Applied  Econometrics  (12/5  (1997)).  The  data  consist 
of  averages  over  groups  of  households  and  were  obtained  by  a  budget  survey 
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Exhibit  4.6  Food  Expenditure  (Example  4.3) 

Scatter  diagram  of  fifty-four  data  points  of  the  fraction  of  expenditure  spent  on  food  against 
total  (consumption)  expenditure. 
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in  the  USA  in  1950.  The  total  consumption  expenditure  is  taken  as  a 
measure  of  the  household  income.  We  denote  the  fraction  of  expenditure 
spent  on  food  by  y,  the  total  consumption  expenditure  (in  $10,000)  by 
X2,  and  the  (average)  household  size  by  x3.  The  scatter  diagram  indicates 
that  the  effect  of  income  on  the  fraction  spent  on  food  declines  for 
higher  income  levels.  Such  a  relation  can  be  expressed  by  the  non-linear 
model 


y  —  P  1  +  P  2X^2  +  p4x3  +  e- 

The  hypothesis  that  the  fraction  spent  on  food  does  not  depend  on  household 
income  corresponds  to  p 3  =  0,  and  the  hypothesis  that  it  depends  linearly  on 
income  corresponds  to  /?3  =  1.  Further  analysis  of  this  example  is  left  as  an 
exercise  (see  Exercise  4.16). 


4.2.2  Non-linear  least  squares 

Uses  Appendix  A.  7. 


Non-linear  regression 

The  linear  regression  model  y  =  Xp  +  e  can  be  written  as  y,  —  x'fi  +  £j,  where 
/'(/'=  1,  ■  •  ■ ,  n)  denotes  the  observation  and  where  x-  is  the  z'th  row  of  the 
11  x  k  matrix  X  (so  that  x,  is  a  k  x  1  vector).  This  model  is  linear  in  the 
unknown  parameters  p.  A  non-linear  regression  model  is  described  by  an 
equation  of  the  form 


yi  =  f(xi,P)  +  Ei,  (4.10) 

where  f  is  a  non-linear  function.  If  the  non-linearity  is  only  in  x,  —  that  is,  if 
for  fixed  x,  the  function  f  is  linear  in  P —  then  this  can  be  written  as 
f(xI,P)  =  P1fi(xI)  +  ---  +  Pkfi:(xl).  This  is  a  linear  regression  model  with 
explanatory  variables  fj(xt),  j  =  1,  •  •  • ,  k.  In  this  case  the  parameters  can  be 
estimated  by  regressing  y  on  the  explanatory  variables  /1, On  the 
other  hand,  if  the  function  is  non-linear  in  ft  —  for  instance,  as  in  (4.9)  — 
then  the  least  squares  estimation  problem  to  minimize 

S(f$)  =  YJ{yi-f(xi,f}))1 

i=  1 


(4.11) 
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becomes  non-linear.  The  first  order  conditions  are  given  by 


dS(P) 

dp 


1=1 


df(xj,  P) 
dp 


=  0. 


This  gives  a  set  of  non-linear  normal  equations  in  p.  In  general  the  solution  of 
these  equations  cannot  be  determined  analytically,  so  that  numerical  ap¬ 
proximations  are  needed.  Numerical  aspects  are  discussed  in  the  next 
section. 


Requirement  of  identified  parameters 

The  non-linear  least  squares  (NLS)  estimator  fijvLS  is  defined  as  the  minimiz¬ 
ing  value  of  (4.11).  We  assume  that  this  minimum  exists  and  that  it  is  unique. 
This  imposes  conditions  on  the  model.  For  example,  if  there  exist  parameter 
vectors  p 1  ^  P2  with  f(xj,  Pi)  =  f{x-„  fi2)  for  all  x„  then  S(Pi)  =  S{P2)  in 
(4.11),  in  which  case  minima  need  not  be  unique.  The  parameters  of  the 
model  (4.10)  are  said  to  be  identified  if  for  all  Pi  ^  P2  there  exists  a  vector  x 
such  that  f(x.  Pi)  f(x,  P2).  The  parameters  of  the  linear  model  with 
f(x,  P)  =  x’ p  are  always  identified  provided  that  the  explanatory  variables 
x  are  not  perfectly  collinear.  So,  if  Assumption  1  is  satisfied  so  that  the 
regressor  matrix  X  has  rank  k,  then  the  parameters  P  of  the  linear  model 
are  identified.  An  example  of  a  non-linear  regression  model  with  unidentified 
parameters  (with  a  single  explanatory  variable  x)  is  f(x,  P)  =  pie^1+^x,  as 
two  parameter  vectors  (Pn,  P2 1,  jS31)  and  (/?12,  P22,  P31)  give  the  same  func¬ 
tion  values  for  all  values  of  x  if  /f31  =  fi32  and  P ur21  =  fiue ^21.  To  avoid 
problems  in  optimization  one  should  work  only  with  models  with  identified 
parameters. 

Statistical  properties  of  non-linear  least  squares 

The  estimator  b^LS  will  in  general  not  be  unbiased.  Under  appropriate 
assumptions  it  is  a  consistent  estimator  and  its  variance  may  be  approxi¬ 
mated  in  large  samples  by 


var (bNLS)  «  s2(X'Xr\ 

where  s2  =  XX 1  (A  —  f(xh  ^nls))2  is  the  NLS  estimate  of  the  variance  of 
the  disturbance  terms  £,.  Here  X  is  the  n  x  k  matrix  of  first  order  derivatives 
of  the  function  f  in  (4.10)  with  respect  to  p  —  that  is, 

/df(xuP)/dp'\ 

\df(xn,P)/dp'J 


X  = 


(4.12) 
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Note  that  for  the  linear  model  with  f(xt,  /?)  =  x'Jl  this  gives  the  matrix  X  as 
defined  in  (3.2)  in  Section  3.1.2  (p.  120). 


Idea  of  conditions  for  asymptotic  properties 

It  is  beyond  the  scope  of  this  book  to  derive  the  above  asymptotic  results  for  buLS- 
However,  we  give  an  idea  of  the  required  assumptions,  which  are  basically  the 
same  as  the  ones  discussed  in  Section  4.1. 

Suppose  that  the  data  are  generated  by  (4.10)  with  parameter  vector  p  =  P0. 
Further  suppose  that  the  disturbance  terms  satisfy  Assumptions  2-4,  and 
that  Assumption  5  (constant  parameters)  is  also  satisfied.  Suppose  further  that 
Assumption  1*  is  satisfied  with  X  as  defined  in  (4.12)  and  evaluated  at  p  =  p0.  Let 
fi>=f(xj,Po)  and  fi  =  f(xi,P),  then  the  least  squares  criterion  (4.11)  can  be 
decomposed  as  follows: 

-  ft2  =  (/?+«,-/, )2 

Of  the  three  terms  in  the  last  expression,  the  middle  one  does  not  depend  on  p  and 
hence  it  does  not  affect  the  location  of  the  minimum  of  S(p).  For  n  — *  oo,  the  last 
term  will  tend  (in  probability)  to  zero  under  appropriate  orthogonality  conditions. 
For  instance,  in  the  linear  model  with  f,  =  x'Ji,  we  get  ff  —  fi  =  x)(/l0  —  P)  and  the 
condition  plim(l^x,e,)  =  0  is  the  orthogonality  condition  (4.4).  Finally,  the 
first  term  \  ~  fi)~  will  not  vanish  for  ft  ^  if  the  parameters  are  identified 

in  the  sense  that  for  every  p  ^  p0 

{f(xi,Po)-f(xi,p))2'Sj  ^0. 

Under  the  above  conditions,  the  minimum  value  of  }tS(P)  is  asymptotically  only 
obtained  for  P  =  p0,  and  hence  b^LS  is  consistent.  Under  similar  conditions  b^LS  is 
also  asymptotically  normally  distributed  in  the  sense  that 


MbNLS-p0)^nO,a2Q-x)  (4.13) 

where  Q  =  plim(lX'X)  with  X  the  n  x  k  matrix  of  first  order  derivatives  defined 
in  (4.12)  and  evaluated  at  P0. 


Approximate  distribution  in  finite  samples 

Under  the  foregoing  conditions,  the  result  in  (4.13)  means  that 
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bNLS  ~  N(/?0,s2(X'X)”1)  (4.14) 

where  X  is  the  matrix  defined  in  (4.12)  and  evaluated  at  b^LS  and  where 
s2  =  —  f(xi->bNLs))2  is  the  NLS  estimate  of  a2.  Under  similar 

conditions  b^LS  is  also  asymptotically  efficient,  in  the  sense  that 
\/n{bNLS  ~  P0)  has  the  smallest  covariance  matrix  among  all  consistent  esti¬ 
mators  of  P0. 

The  result  in  (4.14)  motivates  the  use  of  Ltests  and  F- tests  in  a  similar  way 
as  in  Chapter  3.  For  the  F- test  the  sums  of  squares  are  equal  to  the  minimum 
value  of  S(P)  in  (4.11)  under  the  null  hypothesis  and  under  the  alternative 
hypothesis.  That  is,  let  b^LS  he  the  unrestricted  non-linear  least  squares 
estimator  and  b^LS  the  restricted  estimator  obtained  by  imposing  g  restric¬ 
tions  under  the  null  hypothesis.  Then  under  the  above  assumptions  the  F- test 
is  computed  by 


F  =  (e'ReR  -  e'e)/g 
e'e / (n  —  k) 


F{g,n 


k). 


where  e'e  =  S(bNLs)  is  the  sum  of  squares  (4.11)  obtained  for  the  unrestricted 

NLS  estimate  &nls  and  e'ReR  =  S[bfjLS)  is  the  sum  of  squares  obtained  for 
1)K 

aNLS' 


Summary  of  computations  in  NLS 

The  non-linear  least  squares  estimate  b^LS  is  obtained  by  minimizing  the 
sum  of  squares  (4.11)  —  for  instance,  by  one  of  the  non-linear  optimization 
algorithms  discussed  in  the  next  section.  Under  suitable  regularity  condi¬ 
tions,  and  provided  that  the  parameters  of  the  model  are  identified,  the 
estimator  b^LS  is  consistent  and  asymptotically  normally  distributed. 
Asymptotic  Lvalues  and  F-tests  can  be  obtained  as  in  the  linear  regression 
model,  using  the  fact  that  in  large  enough  samples  var (&nls)  ~  52(X'X)_1 
where  s2  =  e ?  (with  et  =  y,  —  f(x,,  Fnls)  the  NLS  residuals)  and  where 

the  n  x  k  regressor  matrix  X  is  given  in  (4.12),  evaluated  at  /?  =  £>nls- 
Summarizing, 


Computations  for  NLS 

•  Step  1:  Estimation.  Estimate  /?  by  minimizing  (4.1 1)  and  determine  the  NLS 
residuals  e,  =  y{  -  f(xj,  bNLS). 

•  Step  2:  Testing.  Approximate  t-  and  F- tests  can  be  based  on  the  fact  that 

bNLS  «  N(/?,  s2(X'X)_1)  where  s2  =  and  X  is  given  in  (4.12). 
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4.2.3  Non-linear  optimization 

Uses  Appendix  A.  7. 


Numerical  aspects 

If  the  model  (4.10)  is  non-linear,  the  objective  function  S(fi)  in  (4.11)  is  not 
quadratic  and  the  optimal  value  of  fi  cannot  be  written  as  an  explicit 
expression  in  terms  of  the  data  (y,,  x,),  i  =  1,  -  ■  ■ ,  n.  In  this  section  we 
consider  some  numerical  aspects  of  non-linear  optimization.  The  vector  of 
unknown  parameters  is  denoted  by  9  and  the  objective  function  by  F(6),  with 
column  vector  of  gradients  G(6)  =  8F(0)/d0  and  Hessian  matrix 
H(9)  =  d2F(9)/d9d9'.  Optimal  values  of  9  are  characterized  by  the  first 
order  conditions 


G(6)  =  0. 


Numerical  procedures  often  involve  the  following  steps. 


Iterative  optimization 

•  Step  1 :  Start.  Determine  an  initial  estimate  of  9,  say  do- 

•  Step  2:  Improve  arid  repeat.  Determine  an  improved  estimate  of  9,  say  9\. 
Iterate  these  improvements,  giving  a  sequence  of  estimates  9] ,  9i,  63 ,  ■  ■  ■ . 

•  Step  3:  Stop.  Stop  the  iterations  if  the  improvements  become  sufficiently 
small. 


Remarks  on  numerical  methods 

In  general  there  is  no  guarantee  that  the  final  estimate  9  is  close  to  the  global 
optimum.  Even  if  G(0)  «  0,  this  may  correspond  to  a  local  optimum.  To 
prevent  the  calculated  9  being  only  a  local  optimum  instead  of  a  global 
optimum  one  can  vary  the  initial  estimate  of  9  in  step  1.  For  instance,  we  can 
change  each  component  of  the  final  estimate  9  by  a  certain  percentage  and 
take  the  new  values  as  initial  estimates  in  a  new  round  of  iterations.  For 
the  stopping  rule  in  step  3,  one  can  consider  the  percentage  changes  in  the 
estimated  parameters  0/,  and  9h+ 1  in  two  consecutive  iterations  and  the  relative 
improvement  (F(0f,+ 1)  —  F(9b))/  F(9h).  If  these  changes  are  small  enough,  the 
iterations  are  stopped.  If  the  improvements  in  the  objective  function  are  small 
but  the  changes  in  the  parameters  remain  large  in  a  sequence  of  iterations,  this 
may  be  an  indication  of  identification  problems.  A  possible  solution  is  to 
adjust  the  objective  function  or  the  underlying  model  specification. 
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Several  methods  are  available  for  the  iterations  in  step  2.  Here  we  discuss 
two  methods  that  are  often  applied  —  namely,  Newton-Raphson  and  Gauss- 
Newton.  Both  methods  are  based  on  the  idea  of  linear  approximation  — 
namely,  of  the  gradient  G(0)  in  Newton-Raphson  and  of  the  non-linear 
function  f(x,  /?)  in  Gauss-Newton. 

The  Newton-Raphson  method 

The  Newton-Raphson  method  is  based  on  the  iterative  linearization  of  the 
first  order  condition  for  an  optimum  —  that  is,  G(6)  =  0.  Around  a  given 
value  Oh,  the  gradient  G  can  be  linearized  by  G(8 )  ~  G(0/;)  +  H(0h)(0  —  Oh). 
The  condition  G(0)  =  0  is  approximated  by  the  condition  G(0/;)  + 
H(0h){0  —  Oh)  —  0.  These  equations  are  linear  in  the  unknown  parameter 
vector  0  and  they  are  easily  solved,  giving  the  next  estimate 

6h+i=0h-Hh'Gh,  (4.15) 

where  Gh  and  H/,  are  the  gradient  and  Hessian  matrix  evaluated  at  Oh ■  Under 
certain  regularity  conditions  these  iterations  converge  to  a  local  optimum  of 
F(0).  It  depends  on  the  form  of  the  function  F(0)  and  on  the  procedure  to 
determine  initial  estimates  6o  whether  the  limiting  estimate  corresponds  to 
the  global  optimum.  A  graphical  illustration  of  this  method  is  given  in 
Exhibit  4.7,  which  shows  the  (non-linear)  gradient  function  and  two  iter¬ 
ations  of  the  algorithm. 

□  Regularization 

Sometimes  —  for  instance,  if  the  Hessian  matrix  is  nearly  singular  —  the  iterations 
in  (4.15)  are  adjusted  by  a  regularization  factor  so  that 


G(e)  , 


Exhibit  4.7  Newton-Raphson 

Illustration  of  two  Newton-Raphson  iterations  to  find  the  optimum  of  an  objective  function. 
The  graph  shows  the  first  derivative  (G)  of  the  objective  function  as  a  function  of  the  parameter 
6.  The  algorithm  starts  in  do;  d\  and  #2  denote  the  estimates  obtained  in  the  first  and  second 
iteration,  and  0  is  the  optimal  value. 
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&h+ 1  —  &h  ~  (Hh  +  H)  lG/„ 

where  c  >  0  is  a  chosen  constant  and  I  is  the  identity  matrix.  This  forces  the 
parameter  adjustments  more  in  the  direction  of  the  gradient.  The  Newton- 
Raphson  method  requires  the  computation  of  the  gradient  vector  and  the  Hessian 
matrix  in  each  iteration.  In  some  cases  these  can  be  computed  analytically;  in 
other  cases  one  has  to  use  numerical  methods. 


The  Gauss-Newton  method 

In  many  cases  the  computation  of  the  Hessian  matrix  is  cumbersome  and  it  is 
much  more  convenient  to  use  methods  that  require  only  the  gradient.  There¬ 
fore  we  now  discuss  the  Gauss-Newton  method  for  non-linear  regression 
models.  In  this  case  the  parameter  vector  is  9  =  p  and  the  objective  function 
is  S(p)  defined  in  (4.11).  The  idea  is  to  linearize  the  function  f  so  that  this 
objective  function  becomes  quadratic. 


Derivation  of  Gauss-Newton  iterations 

Assuming  that  the  function  f(x,  P)  is  differentiable  around  a  given  value  /?/„  it  can 
be  written  as 


fix,  P)  =  fh(x)  +  gh(x)'iP  -  ph)  +  nix), 

where  fi,(x)  =  fix.  Pi,)  andg/,(x)  =  df(x,  P)/dp  is  the  gradient,  the  k  x  1  vector  of 
first  order  derivatives,  evaluated  at  ph.  Further  r*(x)  is  a  remainder  term  that 
becomes  negligible  if  P  is  close  to  /?/,.  If  we  replace  the  function /(x,  /?)  in  (4.1 1 )  by 
its  linear  approximation,  the  least  squares  problem  becomes  to  minimize 

«  ,  N  2  n 

ShiP )  =  y1  ~  _  Sbixp'iP  -  Ph))  =  ( Zhi  ~  g'biP)2’ 

i=  1  i=  1 

where  zi,t  =  yt  —  fhix,)  +  gh(xp\  Ph  and  gh,  =  ghix,)  are  computed  at  the  given  value 
of  ph.  The  minimization  of  Sh(P)  with  respect  to  P  is  an  ordinary  least  squares 
problem  with  dependent  variable  Zi,,  and  with  independent  variables  gi,,.  Tet  Zh  he 
the  nxl  vector  with  elements  Zh,  and  let  X/,  be  the  n  x  k  matrix  with  rows 
g'hi  =  df(x„  P)/dp’  —  that  is,  the  matrix  (4.12)  evaluated  at  Ph.  Further  let 

ehi  =  Vi  ~  f(xi,  h) 

be  the  residuals  of  the  non-linear  regression  model  (4.10)  corresponding  to  Ph.  The 
value  of  P  that  minimizes  S/,(P)  is  obtained  by  regressing  Zh  on  X/,,  and  using 
the  fact  that  Zh  =  eh  +  ^hPh  it  follows  that  Pb+i  =  (X^X/,)_1Xj;2:/;  =  ( X'hXh)~xX'h 
( ei,  +XhPh)  and  hence 


ph+1  =  Ph  +  (x'hxhy1x'heh. 


(4.16) 
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So,  in  each  Gauss-Newton  iteration  the  parameter  adjustment  fih+1  —  is 
obtained  by  regressing  the  residuals  e /,  of  the  last  estimated  model  on  the  gradient 
matrix  Xf,  evaluated  at  Ph.  The  Gauss-Newton  iterations  are  repeated  until  the 
estimates  converge.  The  usual  expression  for  the  variance  of  least  squares  estima¬ 
tors  in  the  final  iteration  is  a2(X'X)~  ,  where  X  is  the  gradient  matrix  evaluated  at 
the  final  estimate  /?.  This  is  precisely  the  asymptotic  approximation  of  the  variance 
of  the  non-linear  least  squares  estimator  b^LS  in  (4.14).  So  asymptotic  standard 
errors  of  p  are  immediately  obtained  from  the  final  regression  in  (4.16). 


Comparison  of  the  two  methods 

Finally,  we  compare  the  Gauss-Newton  iterations  (4.16)  with  those  of  Newton- 
Raphson  in  (4.15)  for  the  least  squares  criterion  S(p)  in  (4.11).  For  the  criterion 
function  F(6)  =  S(p),  the  gradient  and  Hessian  at  are  given  by 


as 

dp 


~2  (y>  ~  f(x”  ® 


df(xh  fi) 

dp 


~2X'heh 


d2s  ^df(xhp)df(xhp)  ,,  ^d2f(Xi,p) 

dpW  =  %^p - ^ ty-apT 


=  2X'hXh  - 


^  d2f(xi,  p) 
j^e'n  dpdp' 


So  the  Newton-Raphson  iterations  (4.15)  reduce  to  those  of  Gauss-Newton 
(4.16)  if  we  neglect  the  last  term  in  the  above  expression  for  the  Hessian.  This 
can  also  be  motivated  asymptotically,  as  }X'X  has  a  finite  and  non-zero  limit 
(under  Assumption  1*)  and  the  term 
(under  appropriate  orthogonality  conditions). 


and  the  term  ^J2ehigjjgjf  converges  to  zero  for  n 


“S’  Exercises:  S:  4.9;  E:  4.13b,  4.16b,  c. 


4.2.4  The  Lagrange  Multiplier  test 

=©  Uses  Appendix  A.7,  A.8. 

For  the  computation  of  the  F-test  at  the  end  of  Section  4.2.2  we  have  to 
perform  two  non-linear  optimizations,  one  in  the  restricted  model  and 
another  one  in  the  unrestricted  model.  We  now  discuss  an  alternative  ap¬ 
proach  for  testing  parameter  restrictions  that  needs  the  estimates  only  of  the 
restricted  model.  This  test  is  based  on  the  method  of  Lagrange  for  minimiza¬ 
tion  under  restrictions. 
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Interpretation  of  the  Lagrange  multiplier  in  the  linear  model 

For  simplicity  we  first  consider  the  case  of  a  linear  model  with  linear  restric¬ 
tions,  so  that 


y  =  X1p1+X2p2  +  e,  Ho-.P  2  =  0,  (4.17) 

where  p2  contains  g  parameters  and  contains  the  remaining  k  —  g  param¬ 
eters.  We  assume  that  the  restricted  model  contains  a  constant  term,  so  that 
X|  contains  a  column  with  all  elements  equal  to  1.  The  Lagrange  method 
states  that  the  least  squares  estimates  under  the  null  hypothesis  are  obtained 
by  minimization  of  the  (unconstrained)  Lagrange  function 

MP1,P2,X)  =  S(P1,P2)  +  2X'P2,  (4.18) 

where  S(P1,  fi2)  =  (y  —  X\Px  —  X2p2)'{y  —  X\p1  —  X2p2)  is  the  least  squares 
criterion  function  and  l  is  a  vector  with  the  g  Lagrange  multipliers.  The  first 
order  conditions  for  a  minimum  are  given  by 


dA 

Wi 

dA 

Wi 

dA 

~dl 


—2X\(y  —  X\bi  —  X2b2)  =  0, 


-2 X'2(y  -  X\b\  -  X2b2)  +  22  =  0, 


2b2  =  0. 


Substituting  b2  =  0  in  the  first  condition  shows  that  X\(y  —  X\b\)  =  0  — 
that  is,  b\  =  b&  =  (X'xX\)-lX'xy  is  the  restricted  least  squares  estimate 
obtained  by  regressing  y  on  X\.  If  we  write  e&  =  y  —  X\bR  for  the  corres¬ 
ponding  restricted  least  squared  residuals,  then  the  above  three  first  order 
conditions  can  be  written  as 


X\eR  =  0,  1  =  X'2eR,  b2  =  0.  (4.19) 

In  particular,  §A  =  jjA  +  2l  =  0,  so  that  (evaluated  at  the  restricted  esti¬ 
mates) 


_  dS(bu  0) 

dp2 


So  1  measures  the  marginal  decrease  of  the  least  squares  criterion  S  in  (4.11), 
which  can  be  achieved  by  relaxing  the  restriction  that  /f2  =  0.  This  is  illus¬ 
trated  graphically  in  Exhibit  4.8.  In  (a)  the  slope  A  is  nearly  zero  (and  the 
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(a) 

S(6i,  fa) 


( b ) 

S(bu  fa) 


Exhibit  4.8  Lagrange  multiplier 

Graphical  interpretation  of  the  Lagrange  multiplier  in  constrained  optimization.  The 
graphs  show  the  objective  function  as  a  function  of  the  parameter  p2.  In  (a)  the  restriction 
/?2  =  0  is  close  to  the  unrestricted  minimizing  value  (b2),  whereas  in  ( b )  fi2  =  0  is  further 
away  from  b2. 


value  of  S  at  =  0  is  nearly  minimal),  whereas  in  ( b )  the  slope  X  is  further 
away  from  zero  (and  the  value  of  S  at  fi2  =  0  is  further  away  from  the 
minimum). 

The  hypothesis  /?2  =  0  is  acceptable  if  the  sum  of  squares  S  does  not 
increase  much  by  imposing  this  restriction  —  that  is,  if  X  is  sufficiently 
small.  This  suggests  that  the  null  hypothesis  can  be  tested  by  testing  whether 
X  differs  significantly  from  zero.  For  this  purpose  we  need  to  know  the 
distribution  of  X  under  the  null  hypothesis  that  p2  =  0. 


□  Derivation  of  L/l/l-test  statistic 

Under  the  null  hypothesis  that  P2  =  0,  it  follows  that 

eR  =  y-  XtbR  =  Mjy  =  Ml(X1p1  +  e)  =  Mj£, 

where  M\  =  I  —  Xi(X'1Xi)~1X'1.  Under  the  standard  Assumptions  1-7  of  Section 
3.1.4  (p.  125-6),  there  holds  e  ~  N(0,  c2/)  so  that  eR  ~  N(0,  c2Mi)  and 

1  =  X'2eR  ~  N(0,  (72X;MiX2). 

This  means  that  /(X2MiX2)_11/ct2  is  distributed  as  X2(g).  If  the  unknown  vari¬ 
ance  a1  is  replaced  by  the  consistent  estimator  <f2  =  \  e'ReR,  then  it  follows  that 


LM  =  }! {X'2MiXiyYX/ a1  «  X2(g). 


(4.20) 
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This  is  called  the  Lagrange  Multiplier  test  statistic.  The  null  hypothesis  is  rejected 
for  large  values  of  LM,  as  the  value  of  A  then  differs  significantly  from  zero.  Of 
course  we  could  also  use  the  unbiased  estimator  s2  =  e'Re^/{n  —  k  +  g)  instead  of 
a2,  but  here  we  use  <r2  for  ease  of  later  comparisons.  The  difference  between  s2  and 
a1  is  small  if  n  is  sufficiently  large  and  it  disappears  for  n  — *  oo. 


Computation  of  LM-test  by  auxiliary  regressions 

The  expression  for  the  LM-test  in  (4.20)  involves  the  inverse  of  the  matrix 
X'2MiX2.  It  is  convenient  to  compute  the  LM-test  in  an  alternative  way  by 
means  of  regressions.  We  will  show  that  the  value  of  the  LM- test  in  (4.20) 
can  be  computed  by  the  following  steps. 


Computation  of  LM-test 

•  Step  1 :  Estimate  the  restricted  model.  Estimate  the  restricted  model  under 
the  null  hypothesis  that  f2  =  0  —  that  is,  regress  y  on  X\  alone,  with  result 
y  =  X\  bR  +  eR,  where  cr  is  the  vector  of  residuals  of  this  regression. 

•  Step  2:  Auxiliary  regression  of  residuals  of  step  1.  Regress  the  residuals  cr 
of  step  1  on  the  set  of  all  explanatory  variables  of  the  unrestricted  model  — 
that  is,  regress  cr  on  X  =  ( Xj  X2 ). 

•  Step  3:  LM  =  nR1  of  step  2.  Then  LM  =  nR1  of  the  regression  in  step  2, 
and  LM  «  X2(g)  if  the  null  hypothesis  fi2  =  0  holds  true  (where  g  is  the 
number  of  elements  of  /12 — that  is,  the  number  of  restrictions  under  the 
null  hypothesis). 


Derivation  of  auxiliary  regressions 

The  proof  of  the  validity  of  the  above  three-step  computation  of  the  LM- test  is 
based  on  results  obtained  in  Chapter  3.  We  proceed  as  follows.  It  follows  from 
(4.19)  and  a2  =  e'ReR/n  that  (4.20)  can  be  written  as 


LM  =  ne_ 

e'ReR 


To  prove  that  LM  =  nR2  of  step  3,  it  suffices  to  prove  that  the  regression  in  step  2 
has  total  sum  of  squares  SST  =  e'ReR  and  explained  sum  of  squares 
SSE  =  4X2(X;MiX2)_1X^r,  as  by  definition  R2  =  SSE/SST. 

First  we  consider  the  total  sum  of  squares  SST  of  the  regression  in  step  2. 
By  assumption,  the  restricted  model  contains  a  constant  term,  and  as  X'^r  =  0 
it  follows  that  the  mean  of  the  restricted  residuals  eR  is  zero.  Therefore  the  total 
sum  of  squares  of  the  regresion  in  step  2  is  equal  to  SST  =  Yf  [cRi  —  cr)2  = 
E4-  =  e'ReR.  Next  we  consider  the  explained  sum  of  squares  of  the  regression  in 
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step  2.  The  regression  of  eR  on  X  in  the  model  eR  =  Xy  +  uj  gives  y  =  (X'X)~1X'eR 
with  explained  part  eR  =  Xy  =  X(X'X)-1  X'eR.  As  X  contains  a  constant  term,  it 
follows  that  the  mean  of  eR  is  zero.  So  the  explained  sum  of  squares  is 


SSE  =  e'ReR  =  ^XfX'X^XV 


It  remains  to  prove  that  this  can  be  written  as  eRX2(X2MiX2)  1X'2eR.  Now  the 
conditions  in  (4.19)  also  imply  that,  with  X=(Xj  X2 ),  there  holds 


X'cR  = 


Further  it  follows  from  the  results  in  Section 


(x\eR\_(  0 
\X'2eR  \X'2eR 

3.4.1  (p.  161)  that  the  covariance  matrix  of  bi  (the  least  squares  estimator  of  /?2 
in  the  unrestricted  model)  is  equal  to  var [bi)  =  (t2(X2MiX2)~1  (see  (3.46)  (p.  158) 
for  the  case  where  X2  contains  a  single  column).  As  the  covariance  matrix  of  the 
unrestricted  estimators  (b\,  ^2)  of  (/?l5  jS2)  is  equal  to  a2{X'X)~  ,  this  means  that 


(X^MiXi)-1 

results  gives 


is  the  lower  g  x  g  diagonal  block  of  (X'X)  .  Combining  these 


eRX(X'X)_1X'eR  =  (0  4X2)(X'X)-1 

=  4X2(X'M1X2)-1X'eK. 


The  above  results  prove  the  validity  of  the  three-step  procedure  to  compute  the 
LM-test,  so  that 


T„  eRX(X  X)_1X  eR  SSE  n2 

LM  =  n  —  -  - -  =  n—=  nR1, 

eReR  SST 


(4.21) 


where  R 1  is  the  coefficient  of  determination  of  the  auxiliary  regression 


eR  =  Xiyj  +  X2y2  +  w. 


(4.22) 


Interpretation  of  LM-test  and  relation  with  F-test 

The  null  hypothesis  that  f}2  =  0  is  rejected  for  large  values  of  LM  —  that  is, 
for  large  values  of  R1  in  (4.22).  Stated  intuitively,  the  restrictions  are  rejected 
if  the  residuals  eR  under  the  null  hypothesis  can  be  explained  by  the  variables 
X2.  The  LM-test  in  the  linear  model  is  related  to  the  F- test  (3.50).  It  is  left  as 
an  exercise  (see  Exercise  4.6)  to  prove  that  in  the  linear  model 


LM  = 


ngF 

n  —  k  +  gF 


(4.23) 


This  shows  that  for  a  large  sample  size  n  there  holds  LM  ~  gF. 
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Derivation  of  LM- test  in  non-linear  regression  model 

Until  now  we  considered  the  linear  regression  model  (4.17).  A  similar  approach 
can  be  followed  to  perform  tests  in  non-linear  regression  models.  Consider  the 
following  testing  problem,  where  p2  contains  g  parameters  and  ySj  the  remaining 
k  —  g  parameters  in  the  model 


yi  =  f(xi,P1,P2)  +  e,  H0:P2  =  0.  (4.24) 

The  Lagrange  function  is  defined  as  in  (4.18),  with  S  the  non-linear  least  squares 
criterion  in  (4.11).  So  S  =  J2  (yi  —  f(xi,  P i,  Pi))2  and 


8S 

Wi 


—2X\e, 


8S 

Wi 


-2X'e. 


where  Xi  =  df  /8p\  is  the  n  x  (k  —  g)  matrix  of  first  order  derivatives  with  respect 
to  pi,  X2  =  df/dp2  is  the  n  x  g  matrix  of  derivatives  with  respect  to  P2,  and 
e,-  =  y-,  —  f(xj,  b\ ,  b2)  are  the  residuals.  It  follows  from  (4.19)  that  the  first  order 
conditions  8A/8pi  =  0,  dA/dp2  =  0,  and  8A/82  =  0  can  be  written  as 


X'1ReR  —  0,  1  —  X'2ReR,  p2  —  0. 

Here  Xir  and  X2R  are  the  matrices  of  derivatives  X\  =  df  /8p\  and  X2  =  df/dp'2 
evaluated  at  (p2,  p2)  =  (b^LS,  0),  with  b^LS  the  restricted  NLS  estimator  of  P1 
under  the  restriction  that  P2  =  0  and  eRi  =  yi  —  f{xi,  b^LS,  0)  are  the  correspond¬ 
ing  residuals.  The  difference  with  (4.19)  is  that  Xir  and  X2R  depend  on  b^LS,  so 
that  the  normal  equations  X'1ReR  =  0  are  non-linear  in  pt.  As  before,  the  restric¬ 
tions  that  p2  =  0  can  be  tested  by  considering  whether  1  differs  significantly  from 
zero.  Under  the  conditions  of  asymptotic  normality  in  (4.14),  the  test  can  again  be 
computed  (approximately  in  large  enough  samples)  as  in  (4.21). 


LM- test  in  non-linear  regression  model 

The  foregoing  arguments  show  that  the  LM- test  of  the  null  hypothesis  that 
P2  —  0  can  be  computed  as 


LM  =  nR2  «  y2(g), 

with  R 2  of  the  regression  of  the  restricted  residuals  eRj  =  y —  f(Xj,  bfjLS,  0) 
on  the  gradients  X\  =  df /dp\  and  X2  =  df /dp'2,  evaluated  at  (bfjLS,  0).  In 
terms  of  the  Gauss-Newton  iterations  (4.16),  this  means  that  the  residuals  of 
the  last  iteration  (in  the  model  estimated  under  the  null  hypothesis)  are 
regressed  on  the  full  matrix  of  gradients  under  the  alternative  hypothesis 
and  evaluated  at  (bfjLS,  0).  The  LM- test  has  the  advantage  that  only  the 
smaller  model  has  to  be  estimated  by  NLS,  followed  by  an  auxiliary  linear 
regression  as  in  (4.22). 
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Summary  of  computations  for  the  LM-test 

The  LM- test  of  the  hypothesis  that  /f2  =  0  in  the  model  y  =  f(x,  /?i,  P2)  +  e 
can  be  computed  by  means  of  an  auxiliary  regression.  Let  p2  consist  of  g 
components  and  of  (k  —  g )  components. 

Computation  of  LM-test 

•  Step  1:  Estimate  the  restricted  model.  Estimate  the  restricted  model  (with 
/? 2  =  0  imposed),  with  corresponding  vector  of  residuals  cr. 

•  Step  2:  Auxiliary  regression  of  residuals  on  full  set  of  regressors.  Regress  the 


residuals  cr  on  the  n  x  k  matrix  of  first  order  derivatives  X 


•  Step  3:  LM  =  nR 1  of  the  regression  in  step  2.  Then  LM  =  nR 2  of  the 
regression  in  step  2,  and  the  null  hypothesis  is  rejected  for  large  enough 
values  of  the  LM-statistic.  Asymptotically,  the  LM-statistic  follows  the 
X2(g)  distribution  if  the  hypothesis  that  /?2  =  0  holds  true. 


Exercises:  T:  4.6a;  S:  4.10;  E:  4.13d,  4.16g. 


4.2.5  Illustration:  Coffee  Sales 

We  illustrate  the  results  on  non-linear  regression  by  considering 
the  marketing  data  of  coffee  sales  discussed  before  in  Example  4.2  (p.  202). 
We  will  discuss  (i)  the  model,  (ii)  the  non-linear  least  squares  estimates,  (iii) 
results  of  the  Gauss-Newton  iterations,  (iv)  t-  and  E-tests  on  constant  elasti¬ 
city,  and  (v)  the  LM- test  on  constant  elasticity. 

(i)  Model 

In  Example  4.2  in  Section  4.2.1  we  considered  the  non-linear  regression 
model  log  (qf)  =  f{dt,  [3)  +  £,  for  coffee  sales  (q)  in  terms  of  the  deal  rate  ( d ) 
where 


f(d,  p)=^+k(dh-  1). 


Of  special  interest  is  the  hypothesis  that  /13  =  0,  which  corresponds  to  a 
constant  demand  elasticity.  This  case  is  obtained  in  the  limit  for  /f3  — *  0, 
which  gives  the  linear  model  f(d,  f3)  —  f3\+  /L  log  (d). 


XM402COF 


(ii)  Non-linear  least  squares  estimates 

We  first  consider  the  n  =  12  data  for  the  first  brand  of  coffee.  For  a  given  value 
of  /f3  the  model  is  linear  in  the  parameters  /f3  and  /f2  and  these  two  parameters 
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can  be  estimated  by  regressing  log  (qt)  on  a  constant  and  ^{d^3  —  1).  Exhibit 
4.9  (a)  shows  the  minimal  value  of  the  least  squares  criterion  (the  sum  of 
squared  residuals  SSR  in  (4.11))  for  a  grid  of  values  of  /?3.  The  NLS  estimates 
correspond  to  the  values  where  SSR  is  minimal.  This  grid  search  gives 
/?3  =  —13.43,  with  corresponding  estimates  /?,  =  5.81  and  P2  =  10.30.  The 
SSR  at  /?3  is  of  course  lower  than  at  P3  =  0,  and  below  we  will  test  the 
hypothesis  that  /f3  =  0  by  evaluating  whether  this  difference  is  significant. 

(iii)  Gauss-Newton  iterations 

Next  we  apply  the  Gauss-Newton  algorithm  for  the  estimation  of  /?.  As 
starting  values  we  take  Pi  =  0,  p2  =  1?  and  y?3  =  1 .  The  vector  of  gradients  is 
given  by 


df 
dfo 
df 
Ofh 
df 
dP  3 


=  1, 


i), 


-  1)  +^^3log(d). 

Pi 


Exhibit  4.9  shows  the  estimates  of  /?3  (in  (b))  and  the  value  of  SSR  (in  (c))  for 
a  number  of  iterations  of  the  Gauss-Newton  method.  This  shows  that  the 
values  of  SSR  converge,  and  the  same  holds  true  for  the  parameter  estimates. 
The  resulting  estimates  of  a  software  package  are  in  Panel  2  in  Exhibit  4.10. 
The  outcomes  are  in  line  with  the  earlier  results  based  on  a  grid  search  for  /?3. 


(a)  ( b )  ( c ) 


iter 

SSR 

0 

434.313433 

1 

0.443987 

2 

0.105480 

3 

0.0871 76 

4 

0.087049 

5 

0.087049 

6 

0.087049 

7 

0.087049 

8 

0.087049 

9 

0.087049 

10 

0.087049 

Exhibit  4.9  Coffee  Sales  (Section  4.2.5) 

Non-linear  least  squares  for  the  model  for  coffee  sales  of  brand  1.  (a)  shows  the  minimum 
SSR  that  can  be  obtained  for  a  given  value  of  yS3,  and  the  NLS  estimate  corresponds  to 
the  value  of  where  this  SSR  is  minimal,  (b)  shows  the  values  of  / J3  that  are  obtained  in 
iterations  of  the  Gauss-Newton  algorithm,  with  starting  values  P\  =  0  and  fi2  =  /f3  =  I  ■ 
( c )  shows  the  values  of  SSR  that  are  obtained  in  the  Gauss-Newton  iterations. 
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Panel  1:  Dependent  Variable:  LOGQ1  (brand  1, 12  observations) 
Method:  Least  Squares 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

5.841739 

0.043072 

135.6284 

0.0000 

LOGD1 

4.664693 

0.581918 

8.016063 

0.0000 

R-squared 

0.865333 

Sum  squared  resid 

0.132328 

Panel  2:  Dependent  Variable:  LOGQ1  (brand  1, 12  observations) 
Method:  Least  Squares,  convergence  achieved  after  5  iterations 

LOGQ1  = 

C(l)  +  (C(2)/C(3))  *  (D1AC(3)— 1) 

Parameter 

Coefficient 

Std.  Error  t-Statistic 

Prob. 

C(l) 

5.807118 

0.040150  144.6360 

0.0000 

C(2) 

10.29832 

3.295386  3.125072 

0.0122 

C(3) 

-13.43073 

6.674812  -2.012152 

0.0751 

R-squared 

0.911413 

Sum  squared  resid 

0.087049 

Panel  3:  Dependent  Variable:  LOGQ2  (brand  2, 12  observations) 
Method:  Least  Squares 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

4.406561 

0.043048 

102.3638 

0.0000 

LOGD2 

6.003298 

0.581599 

10.32206 

0.0000 

R-squared 

0.914196 

Sum  squared  resid 

0.132183 

Panel  4:  Dependent  Variable:  LOGQ2  (brand  2, 12  observations) 
Method:  Least  Squares,  convergence  achieved  after  5  iterations 

LOGQ2  = 

C(l)  +  (C(2)/C(3»  *  (D2AC(3)— 1) 

Parameter 

Coefficient 

Std.  Error  t-Statistic 

Prob. 

C(l) 

4.377804 

0.043236  101.2540 

0.0000 

C(2) 

10.28864 

3.001698  3.427608 

0.0075 

C(3) 

-8.595289 

5.207206  -1.650653 

0.1332 

R-squared 

0.934474 

Sum  squared  resid 

0.100944 

Panel  5:  Dep  Var:  RESLIN1  (12  residuals  of  Panel  1  for  brand  1)1 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-0.034622 

0.040150 

-0.862313 

0.4109 

LOGD1 

4.449575 

2.115810 

2.103012 

0.0648 

LOGDlA2 

-31.96557 

14.77373 

-2.163676 

0.0587 

R-squared 

0.342177 

Panel  6:  Dep  Var:  RESLIN2  (12  residuals  of  Panel  3  for  brand  2)1 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-0.028757 

0.043236 

-0.665120 

0.5227 

LOGD2 

3.695844 

2.278436 

1.622097 

0.1392 

LOGD2A2 

-26.55080 

15.90927 

-1.668888 

0.1295 

R-squared 

0.236330 

Exhibit  4.10  Coffee  Sales  (Section  4.2.5) 

Regressions  for  two  brands  of  coffee,  models  with  constant  elasticity  (Panels  1  and  3),  models 
with  varying  elasticity  (Panels  2  and  4),  and  auxiliary  regressions  for  LM-tests  on  constant 
elasticities  (Panels  5  and  6). 
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This  table  also  contains  the  NLS  estimates  for  the  second  brand  of  coffee  (in 
Panel  4)  and  the  estimates  under  the  null  hypothesis  that  /?3  =  0  (in  Panels 
1  and  3). 


(iv)  t-  and  F-tests  on  constant  elasticity 

At  5  per  cent  significance,  the  f-test  fails  to  reject  the  null  hypothesis  that 
/?3  =  0  for  both  brands.  The  reported  P-values  (rounded  to  two  decimals)  are 
0.08  for  brand  1  (see  Panel  2)  and  0.13  for  brand  2  (see  Panel  4).  Note, 
however,  that  these  values  are  based  on  the  asymptotic  distribution  in  (4.14) 
and  that  the  number  of  observations  (n  =  12)  is  quite  small,  so  that  the  P- 
values  are  not  completely  reliable.  The  F- tests  for  brands  1  and  2  are  given  by 

_  (0.1323  -  0.0870)/!  _  (0.1322  -  0.1009)/1  _ 

1  0.0870/(12-  3)  '  ’  2  0.1009/(12  -  3)  '  ' 

The  5  per  cent  critical  value  of  the  T(l,  9)  distribution  is  equal  to  5.12,  so 
that  the  hypothesis  that  /?3  =  0  is  again  not  rejected.  As  can  be  checked  from 
the  f-values  in  Panels  2  and  4  in  Exhibit  4.10,  the  T-values  are  not  equal  to 
the  squares  of  the  t-value  of  f3.  The  relation  F  =  t2  for  a  single  parameter 
restriction  was  shown  in  Chapter  3  to  be  valid  for  linear  models,  but  for  non¬ 
linear  models  this  no  longer  holds  true. 


(v)  LM- test  on  constant  elasticity 

Next  we  compute  the  LM-test  for  the  hypothesis  that  f3  =  0.  To  compute 
this  test,  the  residuals  of  the  log-linear  models  (corresponding  to  /?3  =  0)  are 
regressed  on  the  partial  derivatives  df /df (evaluated  at  the  estimated  par¬ 
ameters  under  the  null  hypothesis,  so  that  limits  for  /?3  — »  0  should  be  taken). 
This  gives 


df 
dfo 
df 
dfo 
df 
df  3 


=  1, 

=  lim 


=  lim 


(it)”'0*14 

(|(^iogW, -!(/,-,))) 


02(log  (d))1. 


so  the  relevant  regressors  in  step  2  of  the  LM  computation  scheme  are 
1,  log(d),  and  (log(d))2.  The  results  of  the  auxiliary  regressions  in  (4.22) 
for  the  two  brands  are  in  Panels  5  and  6  in  Exhibit  4.10.  So  the  test  statistics 
(rounded  to  two  decimals)  are  LM3  =  12R-J  =  12  ■  0.34  =  4.11  and 
LM2  =  121^2  =  12  ■  0.24  =  2.84.  The  5  per  cent  critical  value  of  the  y2(l) 
distribution  is  equal  to  3.84,  so  that  in  this  case  the  null  hypothesis  is  rejected 
for  brand  1,  but  not  for  brand  2. 


222  4  Non-Linear  Methods 


4.3  Maximum  likelihood 


4.3.1  Motivation 

Two  approaches  in  estimation 

In  Section  1.3.1  (p.  41)  we  discussed  two  approaches  in  parameter  esti¬ 
mation.  One  is  based  on  the  idea  of  minimizing  the  distance  between  the 
data  and  the  model  parameters  in  some  way.  Least  squares  is  an  example  of 
this  approach.  Although  this  is  a  very  useful  method,  it  is  not  always  the  most 
appropriate  approach,  and  for  some  models  it  is  even  impossible  to  apply  this 
method,  as  will  become  clear  in  later  chapters.  Another  approach  in  param¬ 
eter  estimation  is  to  maximize  the  likelihood  of  the  parameters  for  the 
observed  data.  Then  the  parameters  are  chosen  in  such  a  way  that  the 
observed  data  become  as  likely  or  ‘probable’  as  possible.  In  this  section  we 
will  discuss  the  method  of  maximum  likelihood  (ML)  in  more  detail.  We  will 
consider  the  general  framework  and  we  will  use  the  linear  model  as  an 
illustration.  The  ML  method  is  the  appropriate  estimation  method  for  a 
large  variety  of  models,  and  applications  for  models  of  special  interest  in 
business  and  economics  will  be  discussed  in  later  chapters. 

Some  disadvantages  of  least  squares 

If  we  apply  least  squares  in  the  linear  model  y  =  X/l  +  a,  then  the  estimator  is 
given  by 


b  =  (X'Xr'X'y  =  p  +  (X'Xr'X'e. 

This  means  that  the  (unobserved)  disturbances  a  affect  the  outcome  of  b  in  a 
linear  way.  If  some  of  the  disturbances  a,  are  large,  these  observations  have  a 
relatively  large  impact  on  the  estimates.  There  are  several  ways  to  reduce  the 
influence  of  such  observations  —  for  instance,  by  adjusting  the  model,  by 
transforming  the  data,  or  by  using  another  criterion  than  least  squares.  These 
methods  are  discussed  in  Chapter  5.  Another  approach  is  to  replace  the 
normal  distribution  of  the  disturbances  by  another  distribution  —  for  in¬ 
stance,  one  that  has  fatter  tails.  We  recall  from  Section  3.1.4  (p.  127) 
that  OLS  is  the  best  linear  unbiased  estimator  under  Assumptions  1-6. 


4.3  Maximum  likelihood  223 


However,  if  the  disturbances  are  not  normally  distributed  (so  that  Assump¬ 
tion  7  is  not  satisfied),  there  exist  non-linear  estimators  that  are  more 
efficient.  Asymptotically,  the  most  efficient  estimators  are  the  maximum 
likelihood  estimators. 

Example  4.4:  Stock  Market  Returns  (continued) 

We  investigate  the  assumption  of  normally  distributed  disturbances  in 
the  CAPM  for  stock  market  returns  discussed  before  in  Example  2.1 
(p.  76-7).  We  will  discuss  (i)  the  possibility  of  fat  tails  in  returns  data, 
(ii)  the  least  squares  residuals,  and  (iii)  choice  of  the  distribution  of  the 
disturbances. 

(i)  Possibility  of  fat  tails  in  returns  data 

Traders  on  financial  markets  may  react  relatively  strongly  to  positive  or 
negative  news,  and  in  particular  they  may  react  to  the  behaviour  of  fellow 
traders.  This  kind  of  herd  behaviour  may  cause  excessive  up  and  down 
swings  of  stock  prices,  so  that  the  returns  may  be  larger  (both  positive  and 
negative)  than  would  normally  be  expected.  Such  periods  of  shared  panic  or 
euphoria  among  traders  may  lead  to  returns  far  away  from  the  long-run 
mean  —  that  is,  in  the  tail  of  the  distribution  of  returns. 

(ii)  Least  squares  residuals 

The  data  consist  of  excess  returns  data  for  the  sector  of  cyclical  consumer 
goods  (denoted  by  y)  and  for  the  whole  market  (denoted  by  x)  in  the  UK.  The 
CAPM  postulates  the  linear  model 

Ji  =  a  +  fix,  +  £,,  i  =  1,  •  •  • ,  n. 

A  scatter  diagram  of  these  data  is  given  in  Exhibit  2.1  (c)  (p.  77).  The 
parameters  a  and  ft  can  be  estimated  by  least  squares.  The  histogram  of 
the  least  squares  residuals  is  shown  in  Exhibit  4.11  (a).  It  seems  some¬ 
what  doubtful  that  the  disturbances  are  normally  distributed.  The  sample 
mean  and  standard  deviation  of  the  residuals  e,  are  e  —  0  and 
s  =  /{n  —  1)  =  5.53.  Two  of  the  n  =  240  residuals  have  values  of 

around  —20  ~  —3.6s.  For  the  normal  distribution,  the  probability  of  out¬ 
comes  more  than  3.6  standard  deviations  away  from  the  mean  is  around 
0.0003,  which  is  much  smaller  than  2/240  =  0.0083.  The  histogram  indi¬ 
cates  that  the  disturbances  may  have  fatter  tails  than  the  normal  distribution. 

(iii)  Choice  of  distribution  of  the  disturbances 

As  an  alternative,  one  could  for  instance  use  a  t-distribution  for  the  disturb¬ 
ances.  Exhibit  4.11  ( b )  shows  the  density  function  of  the  standard  normal 
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-4  -3  -2  J1  0  1  2  3  4 

X 

^  Degrees  of  freedom  (d)  5  6  7  8  9  10  1000  oo 

Kurtosis  of  t(d)  9.0  6.0  5.0  4.5  4.2  4.0  3.006  3 


Exhibit  4.1 1  Stock  Market  Returns  (Example  4.4) 

(a)  shows  the  histogram  of  the  least  squares  residuals  of  the  CAPM  for  the  sector  of  cyclical 
consumer  goods  in  the  UK.  ( b )  shows  two  distributions,  the  standard  normal  distribution  and 
the  1(5)  distribution  (scaled  so  that  it  has  variance  1);  the  f-distribution  has  fatter  tails  than  the 
normal  distribution,  (c)  shows  the  kurtosis  of  1-distributions  for  selected  values  of  the  number 
of  degrees  of  freedom. 


distribution  and  of  the  t{5)  distribution  (scaled  so  that  it  also  has  variance 
equal  to  one).  The  table  with  values  of  the  kurtosis  in  Exhibit  4.11  (c)  shows 
that  1-distributions  have  fatter  tails  than  the  normal  distribution.  In  the  next 
sections  we  describe  the  method  of  maximum  likelihood  that  can  be  applied 
for  any  specified  distribution. 


4.3.2  Maximum  likelihood  estimation 

=©  Uses  Section  1.3.1;  Appendix  A. 7. 

The  idea  of  maximum  likelihood 

In  Section  1.3.1  we  discussed  the  method  of  maximum  likelihood  estimation 
for  data  consisting  of  a  random  sample  from  a  population  with  fixed  mean 
and  variance.  The  idea  is  illustrated  in  Exhibit  4.12.  The  observed  values  of 
the  dependent  variable  are  indicated  by  crosses.  Clearly,  this  set  of  outcomes 
is  much  more  probable  for  the  distribution  on  the  right  side  than  for 
the  distribution  on  the  left  side.  This  is  expressed  by  saying  that  the  distribu¬ 
tion  on  the  right  side  has  a  larger  likelihood  then  the  one  on  the  left  side. 
For  a  random  sample  y\,---,yn  from  the  normal  density  N(r,  a1),  the 
normal  distribution  with  the  largest  likelihood  is  given  by  fi  =  y  and 
O2  =  ( yi  -  y)2,  see  Section  1.3.1. 
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Exhibit  4.1 2  Maximum  likelihood 

The  set  of  actually  observed  outcomes  of  y  (denoted  by  the  crosses  on  the  horizontal  axis) 
is  less  probable  for  the  distribution  on  the  left  than  for  the  distribution  on  the  right; 
the  distribution  on  the  left  therefore  has  a  smaller  likelihood  than  the  distribution  on  the 
right. 


Likelihood  function  and  log-likelihood 

We  now  extend  the  maximum  likelihood  (ML)  method  to  more  general 
models.  The  observed  data  on  the  dependent  variable  are  denoted  by  the 
n  x  1  vector  y  and  those  on  the  explanatory  variables  by  the  n  x  k  matrix  X. 
In  order  to  apply  ML,  the  model  is  expressed  in  terms  of  a  joint  probability 
density  p(y,  X,  9).  Here  9  denotes  the  vector  of  model  parameters,  and  for 
given  values  of  9,  p(y,  X,  9)  is  a  probability  density  for  (y,  X).  On  the  other 
hand,  for  given  (y,  X)  the  likelihood  function  is  defined  by 

L(9)  =  p(y,  X,  9).  (4.25) 

Stated  intuitively,  this  measures  the  ‘probability’  of  observing  the  data  (y,  X) 
for  different  values  of  9.  It  is  natural  to  prefer  parameter  values  for  which  this 
‘probability’  is  large.  The  maximum  likelihood  estimator  9ml  is  defined  as 
the  value  of  9  that  maximizes  the  function  L(9)  over  the  set  of  allowed 
parameter  values.  In  practice,  for  computational  convenience  one  often 
maximizes  the  logarithmic  likelihood  function  or  log-likelihood 

1(9)  =  log  (L{9)).  (4.26) 

As  the  logarithm  is  a  monotonically  increasing  transformation,  the  max¬ 
imum  of  (4.25)  and  (4.26)  is  obtained  for  the  same  values  of  9.  An  attractive 
property  of  ML  is  that  it  is  invariant  under  reparametrization.  That  is, 
suppose  that  the  model  is  formulated  in  terms  of  another  parameter  vector 
if  and  that  if  and  9  are  related  by  an  invertible  transformation  if  =  h(9).  Then 
the  ML  estimates  are  related  by  t j/ML  =  h(9ML.)  (see  also  Section  1.3.1). 

The  log-likelihood  can  be  decomposed  if  the  observations  (y„  xf)  are 
mutually  independent  for  i  =  1,-  •  ■,  n.  If  the  probability  density  function 
for  the  ith  observation  is  pg(yi,  xf),  then  the  joint  density  is 
p(y,  X,  9)  =  n nl=1pe(yt,  Xi)  so  that 
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n  n 

1(0)  =  X  loS  (PeiVh  Xi))  =  X  li(0),  (4.27) 

i=  1  i=  1 

where  lj(6)  =  log  ( peiji ,  x,))  is  the  contribution  of  the  z'th  observation  to  the 
log-likelihood  1(6). 


Numerical  aspects  of  optimization 

In  general  the  computation  of  9ml  is  a  non-linear  optimization  problem.  Solution 
methods  were  discussed  in  Section  4.2.3.  For  instance,  the  Newton-Raphson 
iterations  in  (4.15)  can  be  performed  with  the  gradient  vector  G  =  dl(0)/d0  and 
with  the  Hessian  matrix  H  =  02 1(9)/ 0909' ,  or  equivalently  with  G  —  \0l(9)/ 89 
and  H  =  j  02 1(9) / d9d9' .  If  the  observations  are  mutually  independent,  the  result  in 
(4.27)  shows  that  then 


_  1  UA  dl,  1  y  d2li 

^n^i^9,  ^n^dOdO1' 

l—l  1=1 

In  this  case  it  is  also  possible  to  perform  the  iterations  in  a  way  where  only  the  first 
order  derivatives  (and  no  second  order  derivatives)  need  to  be  computed.  In  this 
case  there  holds 


1  y  d2l, 

n  0909' 
1=1 


E 


'  d2l,  ' 
0909 ' 


OhOU 

0909' 


1  yOUOh_ 

n  “X  09  09' 
1=1 


(4.28) 


The  first  and  the  last  approximate  equalities  follow  from  the  law  of  large  numbers, 
as  the  terms  02lj/0909'  are  mutually  independent  and  the  same  holds  true  for  the 
terms  (0li/09)(0lj/09').  The  middle  equality  in  (4.28)  follows  from  (1.46)  in 
Section  1.3.2  (p.  45)  (applied  for  each  individual  observation  i  separately).  The 
last  term  in  (4.28)  is  called  the  outer  product  of  gradients.  Using  this  approxima¬ 
tion,  the  Newton-Raphson  iterations  (4.15)  become 


This  is  called  the  method  of  Berndt,  Hall,  Hall,  and  Hausman  (abbreviated  as 
BHHH).  As  discussed  in  Section  4.2.3,  one  sometimes  uses  a  regularization  factor 
and  replaces  the  above  matrix  inverse  by  MW  ~l~  c^)  1  w't'1  a 

chosen  constant  and  I  the  identity  matrix.  This  is  called  the  Marquardt  algorithm. 
These  methods  have  the  advantage  that  they  require  only  the  first  order  deriva¬ 
tives,  but  they  may  give  less  precise  estimates  as  compared  with  methods  using  the 
second  order  derivatives. 
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ML  in  the  linear  model 

In  some  cases  the  ML  estimates  can  be  computed  analytically.  An  example  is 
given  by  the  linear  model  under  Assumptions  1-7  of  Section  3.1.4  (p.  125-6). 
In  this  case 


y  =  Xp  +  s,  e~N(0,  u 2 1),  (4.29) 

so  that  y  ~  N(X/J,  a2I).  This  model  has  parameters  6  =  ,  a2)'.  Using  the 

expression  (1.21)  in  Section  1.2.3  (p.  31)  for  the  density  of  the  multivariate 
normal  distribution  with  mean  p  =  Xp  and  covariance  matrix  2  =  a2I,  it 
follows  that  the  log-likelihood  (4.26)  is  given  by 

-j 

l(P,  °2)  =  -^log(2n)  -^log(<j2)  -  j^(y  -  XP)'{y  -  Xfi).  (4.30) 

The  maximum  likelihood  estimates  are  obtained  from  the  first  order  condi¬ 
tions 


dl  1 

=  X'(y-XP)  =  0,  (4.31) 

op  a 1 

^  =  -i  +  2^-X»'b'-X»  =  a  <4321 

The  solutions  are  given  by 

buL  =  (X'Xr'X’y  =  b,  (4.33) 

s2ml  =  -(y-  Xb)'(y  -  Xb)  =  —s2,  (4.34) 

n  n 

where  s2  is  the  (unbiased)  least  squares  estimator  of  a2  discussed  in  Section 
3.1.5  (p.  128).  This  shows  that  &ml  coincides  with  the  least  squares 
estimator  b,  and  that  s^L  differs  from  the  unbiased  estimator  s2  by  a  factor 
that  tends  to  1  for  n  — >  oo. 

ML  in  non-linear  regression  models 

In  a  similar  way,  the  ML  estimates  of  ft  in  the  non-linear  regression  model 
yi  =  f(xj,  ft)  +  £j  with  Si  ~  NID(0,  a2)  are  equal  to  the  non-linear  least 
squares  estimates  &nls  (see  Exercise  4.6). 


Exercises:  T:  4.6c;  S:  4.12a,  b,  d-f. 
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4.3.3  Asymptotic  properties 

Uses  Sections  1.3.2,  1.3.3;  Appendix  A.7. 


Asymptotic  distribution  of  ML  estimators 

As  was  discussed  in  Section  1.3.3  (p.  51-2),  maximum  likelihood  estima¬ 
tors  have  asymptotically  optimal  statistical  properties.  Apart  from  mild  regu¬ 
larity  conditions  on  the  log-likelihood  (4.26),  the  main  condition  is  that 
the  model  (that  is,  the  joint  probability  distribution  of  the  data)  has  been 
specified  correctly.  Then  the  maximum  likelihood  estimator  is  consistent ,  is 
asymptotically  efficient,  and  has  an  asymptotically  normal  distribution. 
The  model  is  correctly  specified  if  there  exists  a  parameter  9q  so  that  the 
data  are  generated  by  the  probability  distribution  p(y,  X,  9q).  The  asymp¬ 
totic  efficiency  means  that  \Zn(0ML  —  #o)  has  the  smallest  covariance  matrix 
among  all  consistent  estimators  of  Oq  (the  reason  for  scaling  with  sfn  is  to  get 
a  finite,  non-zero  covariance  matrix  in  the  limit).  Some  regularity  conditions 
are  necessary  for  generalizations  of  the  central  limit  theorem  to  hold  true, 
so  that 


M0ml-0o)^N(0,  Jo1)-  (4.35) 

Here  To  is  the  asymptotic  information  matrix  evaluated  at  9q — that  is, 
T0  =  lim^oo (4t„(0o))  where 


T„(0O)  =  E 


'dl  dl' 

—  —  F 

'  82l  ' 

89  W_ 

—  ih 

_d989'_ 

(4.36) 


is  the  information  matrix  (evaluated  at  9  =  do)  for  sample  size  n  of  the  data 
(y,  X)  in  (4.25). 

Approximate  distribution  for  finite  samples 

This  means  that,  asymptotically,  conventional  t-  and  F- tests  can  be  based  on 
the  approximate  distribution 

9ml  ~  N^do,  T“1(0ml)) 

where  we  used  that  in  large  enough  samples  var (9ml)  ~  \ Tq  1  k,  T ~1(0o)  ~ 
T jj 1  (  9ml  )  •  In  the  following  sections  we  discuss  some  alternative  tests  that  are 
of  much  practical  use  —  namely,  the  Likelihood  Ratio  test,  the  Wald  test,  and 
the  Lagrange  Multiplier  test.  These  tests  are  compared  in  Section  4.3.8. 
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Illustration  of  asymptotic  results  for  ML  in  the  linear  model 

Now  we  illustrate  the  above  asymptotic  results  for  the  linear  model  y  =  Xp  +  e 
that  satisfies  Assumptions  1-7.  Here  the  parameter  vector  is  given  by  the 
(F+l)xl  vector  9  =  {ft ,  a2)'  and  the  vector  of  ML  estimators  by 
9ml  =  ( )  in  (4.33)  and  (4.34).  According  to  (4.35),  these  estimators 
(when  measured  in  deviation  from  9o  and  multiplied  by  \fn)  are  asymptotically 
normally  distributed  with  covariance  matrix  Zq1.  The  second  order  derivatives  of 
the  log-likelihood  in  (4.30)  are  given  by 


d2l 

dpdp' 

d2l 

dpda2 

d2l 

da2 da 2 


4x'x, 

(Tz 

__LX'(y-X/l), 

^-^y-xpUy-xp). 


(4.37) 

(4.38) 

(4.39) 


Using  (4.29)  and  the  assumption  that  X  is  fixed,  for  /?  =  /?0  there  holds 
E[X'(y  -  Xp0)]  =  X'E[y  -  Xp0]  =  0  and  E[(y  -  Xp0)’(y  -  Xp0)]  =  na2,  so  that 
the  (k  +  1)  x  (k  +  1)  information  matrix  in  (4.36)  is  given  by 


IQ  ,  Mx'x  o  \ 

ln(0o)-  ^  q  (4.40) 

The  asymptotic  covariance  matrix  is  obtained  from  To  =  lim(lZ„(0o) ),  and  under 
Assumption  1*  in  Section  4.1.2  it  follows  that 


To  = 


iQ 

o 


i 

2<t4 


Therefore,  large  sample  approximations  of  the  distribution  of  the  ML  estimators 
(4.33)  and  (4.34)  for  the  linear  model  are  given  by 

bML-Ni^a^X'Xy1),  (4.41) 

s2ml  «n(4,  2y^j.  (4.42) 

Actually,  for  the  model  (4.29)  the  distribution  in  (4.41)  holds  exactly,  as  was 
shown  in  Section  3.3.1  (p.  152).  In  Section  3.4.1  we  considered  the  F-test  for  the 
null  hypothesis  of  g  linear  restrictions  on  the  model  (4.29)  of  the  form  Rfl  =  r, 
where  R  is  a  g  x  k  matrix  of  rank  g.  In  (3.50)  this  test  is  computed  in  the  form 


F  =  ( e'ReR  -  e'e)/g 
e'e/(n  —  k) 


(4.43) 
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where  e'ReR  and  e' e  are  the  sums  of  squared  residuals  under  the  null  and  alternative 
hypothesis  respectively.  Under  Assumptions  1-7,  the  test  statistic  (4.43)  has  the 
F(g,  n  —  k)  distribution. 

Summary  of  computations  in  ML 

To  estimate  model  parameters  by  the  method  of  maximum  likelihood,  one 
proceeds  as  follows. 


Computations  in  ML 

•  Step  1:  Formulate  the  log-likelihood.  First  one  has  to  specify  the  form  of  the 
likelihood  function  —  that  is,  the  form  of  the  joint  probability  function 
L(0)  =  ply,  X,  9).  For  given  data  y  and  X,  this  should  be  a  known  function 
of  0  —  that  is,  for  every  choice  of  6  the  value  of  L(6)  can  be  computed.  The 
criterion  for  estimation  is  the  maximization  of  L(6),  or,  equivalently,  the 
maximization  of  the  log-likelihood  1(6)  =  log  (L(6)). 

•  Step  2:  Maximize  the  log-likelihood.  For  the  observed  data  y  and  X,  the  log- 
likelihood  1(6)  =  log  (p(y,  X,  8))  is  maximized  with  respect  to  the  param¬ 
eters  6.  This  is  often  a  non-linear  optimization  problem,  and  numerical 
aspects  were  discussed  in  Section  4.3.2. 

•  Step  3:  Asymptotic  tests.  Approximate  t- values  and  F- tests  for  the  ML 
estimates  9ml  can  be  obtained  from  the  fact  that  this  estimator  is  consistent 
and  approximately  normally  distributed  with  covariance  matrix 
var (9ml)  K  1^(8 ml),  where  Tn  is  the  information  matrix  defined  in 
(4.36)  and  evaluated  at  9ml ■  In  Section  4.3.8  we  will  make  some  comments 
on  the  actual  computation  of  this  covariance  matrix. 


Exercises:  E:  4.17a-f. 


4.3.4  The  Likelihood  Ratio  test 

“S'  Uses  Appendix  A.  8. 


General  form  of  the  LR- test 

Suppose  that  the  model  is  given  by  the  likelihood  function  (4.25)  and  that  the 
null  hypothesis  imposes  g  independent  restrictions  r(6)  =  0  on  the  param¬ 
eters.  We  denote  the  ML  estimator  under  the  null  hypothesis  by  6o  and 
the  ML  estimator  under  the  alternative  by  9 The  Likelihood  Ratio  test  is 
based  on  the  loss  of  log-likelihood  that  results  if  the  restrictions  are 
imposed  —  that  is, 
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log  L 


Exhibit  4.13  Likelihood  Ratio  test 

Graphical  illustration  of  the  Likelihood  Ratio  test.  The  restrictions  are  rejected  if  the  loss  in  the 
log-likelihood  (measured  on  the  vertical  axis)  is  too  large. 


LR  =  21og(L(0i))  -  21og(L(0o))  =  2 1(h)  -  2 1(80).  (4.44) 

A  graphical  illustration  of  this  test  is  given  in  Exhibit  4.13.  Here  9  is  a  single 
parameter  and  the  null  hypothesis  is  that  0=0.  This  hypothesis  is  rejected  if 
the  (vertical)  distance  between  the  log-likelihoods  is  too  large.  It  can  be 
shown  that,  if  the  null  hypothesis  is  true, 

LR^xHg)-  (4.45) 

The  null  hypothesis  is  rejected  if  LR  is  sufficiently  large.  For  a  proof  of  (4.45) 
we  refer  to  textbooks  on  statistics  (see  Chapter  1,  Further  Reading  (p.  68)). 


LR-test  in  the  linear  model 

As  an  illustration  we  consider  the  linear  model  y  =  Xfi  +  e  with  Assumptions  1-7. 
To  compute  the  LR- test  for  the  null  hypothesis  that  Rfi  =  r  (with  Rag  x  k  matrix 
of  rank  g),  we  use  a  technique  known  as  concentration  of  the  log-likelihood.  This 
means  that  the  ML  optimization  problem  is  transformed  into  another  one  that 
involves  less  parameters.  For  a  linear  model  it  follows  from  (4.32)  that,  for  given 
value  of  /?,  the  optimal  value  of  a1  is  given  by  cr2(/?)  =  1  (y  —  Xf])'(y  —  Xfi).  Substi¬ 
tuting  this  in  (4.30),  the  optimal  value  of  /i  is  obtained  by  maximizing  the  concen¬ 
trated  log-likelihood 


=  -  ”  log  (2tt)  -  ”  log  (<t2(^))  -  ”  . 


/(/i,  a2m 
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This  function  of  fi  is  maximal  if  c2(fl)  is  minimal,  and  this  corresponds  to  least 
squares.  The  maximum  likelihood  estimator  of  fl  under  the  null  hypothesis  is 
therefore  given  by  the  restricted  least  squares  estimator  bR,  and  the  above  expres¬ 
sion  for  the  log-likelihood  shows  that 


LR  =  2/(0i )  -  2/(0o )  =  —n  log  (<?2(b) )  +  n  log  (cr2(bR))  =  wl°g(~7~)’ 

where  b  is  the  unrestricted  least  squares  estimator.  The  relation  between  this  test 
and  the  F-test  in  (4.43)  is  given  by 


LR  =  n  log  f  1  +  — — )  =n\og(l+-^-rF).  (4.46) 

\  e'e  )  V  n  —  k  / 

This  result  holds  true  for  linear  models  with  linear  restrictions  under  the  null 
hypothesis.  It  does  not  in  general  hold  true  for  other  types  of  models  and  restric¬ 
tions. 

Computational  disadvantage  of  the  LR- test 

The  LR- test  (4.44)  requires  that  ML  estimates  are  determined  both  for  the 
unrestricted  model  and  for  the  restricted  model.  If  the  required  computations 
turn  out  to  be  complicated,  then  it  may  be  more  convenient  to  estimate  only 
one  of  these  two  models.  Two  of  such  test  methods  are  discussed  in  the 
following  two  sections. 

Exercises:  E:  4.13e,  4.14b,  4.15b,  4.16f. 


4.3.5  The  Wald  test 

Idea  of  Wald  test  (for  a  single  parameter) 

Whereas  the  LR- test  requires  two  optimizations  (ML  under  the  null  hypoth¬ 
esis  and  ML  under  the  alternative  hypothesis),  the  Wald  test  is  based  on  the 
unrestricted  model  alone.  This  test  considers  how  far  the  restrictions  are 
satisfied  by  the  unrestricted  estimator  0 This  is  illustrated  graphically  in 
Exhibit  4.14  for  the  simple  case  of  a  single  parameter  9  with  the  restriction 
9  =  0.  The  (horizontal)  difference  between  the  unrestricted  estimator  0 1  and 
0  =  0  is  related  to  the  (vertical)  difference  in  the  log-likelihoods.  Because 
only  the  unrestricted  model  is  estimated,  an  indication  of  this  vertical  dis¬ 
tance  is  obtained  by  the  curvature  of  the  log-likelihood  /  in  0i.  The  exhibit 
shows  that  this  distance  becomes  larger  for  larger  curvatures.  Asymptotic¬ 
ally,  the  curvature  is  equal  to  the  inverse  of  the  covariance  matrix  of  0i,  see 
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log  Lj 

l°g  L2 


Exhibit  4.14  Wald  test 

Graphical  illustration  of  the  Wald  test.  The  restrictions  are  rejected  if  the  estimated  parameters 
are  too  far  away  from  the  restrictions  of  the  null  hypothesis.  This  is  taken  as  an  indication  that 
the  loss  in  the  log-likelihood  is  too  large,  and  this  ‘vertical’  difference  is  larger  if  the  log- 
likelihood  function  has  a  larger  curvature. 


(4.35)  and  (4.36).  This  motivates  to  estimate  the  loss  in  log-likelihood,  that 
results  from  imposing  the  restriction  that  9  =  0,  by  the  Wald  test  statistic 


w  =  e\- 


z2(D- 


Here  s?  is  an  estimate  of  the  variance  of  the  unrestricted  ML  estimator  9 \ 
and  the  asymptotic  distribution  follows  from  (4.35).  The  expression  9\/sgi  is 
analogous  to  the  Lvalue  in  a  regression  model  (see  Section  3.3.1  (p.  153)). 
The  f-test  for  a  single  parameter  restriction  is  also  obtained  by  estimating  the 
unrestricted  model  and  evaluating  whether  the  estimated  parameter  differs 
significantly  from  zero. 


Derivation  of  Wald  test  for  general  parameter  restrictions 

Now  we  describe  the  Wald  test  for  the  general  case  of  g  non-linear  restrictions 
r(9)  =  0.  Suppose  that  this  hypothesis  holds  true  for  the  DGP,  so  that  r(6o)  =  0. 
Because  0\  is  consistent,  it  follows  that  in  large  enough  samples  r{9\ )  «  r(6o)  + 
Ro(9i  —  0O )  =  Rq(9i  —  90),  where  Rq  =  dr/d8'  evaluated  at  6  =  0q.  It  follows 
from  (4.35)  that 
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V^r(0!)-4N(O,  RoT^R'o).  (4.47) 

Let  R\  =  dr/dO'  evaluated  at  9  =  9i  and  let  be  the  information  matrix  for 

sample  size  n  defined  in  (4.36)  evaluated  at  9  =  9\.  Then  plim(^i)  =  Ro  and 
plim(  jTn(d\))  =  To,  and  (4.47)  implies  that 

r(6>i) «  n(o,  Ra-HWK). 

Now  recall  that,  if  the  g  x  1  vector  z  has  the  distribution  N(0,  V)  then 
z'V^Z  ~  y2(g),  so  that  under  the  null  hypothesis 

W  =  r(di)' (R^l^1  (0\)R'1)~1r{9i)  »  X2(g).  (4.48) 

This  is  an  attractive  test  if  the  restricted  model  is  difficult  to  estimate  —  for 
instance,  if  the  parameter  restriction  r{9)  =  0  is  non-linear.  A  disadvantage  is 
that  the  numerical  outcome  of  the  test  may  depend  on  the  way  the  model  and 
the  restrictions  are  formulated  (see  Exercise  4.16  for  an  example). 


Wald  test  in  the  linear  model 

We  illustrate  the  Wald  test  by  considering  the  linear  model  y  =  Xfl  +  e  with 
Assumptions  1-7  and  the  linear  hypothesis  that  Rfl  =  r  (with  R  a  g  x  k  matrix 
of  rank  g).  The  parameter  vector  6  =  (/?',  a2)1  contains  k  +  1  parameters  and  the 
restrictions  are  given  by  r(0)  =  0,  where  r(0)  =Rfl  —  r—  {R  0)9  —  r.  The  unre¬ 
stricted  estimators  are  given  by  b  in  (4.33)  and  s^L  =  e'e/n  in  (4.34),  where 
e  =  y  —  Xb  are  the  unrestricted  least  squares  residuals.  So  in  (4.48)  we  have 
r{9\)  =  Rb  —  r  and  R\  =  dr/d6'  =  {dr/dfi  dr /da1)  =  {R  0).  An  asymptotic  ap¬ 
proximation  of  the  inverse  of  the  information  matrix  in  (4.48)  is  obtained  from 
(4.40)  —  that  is, 


f  a2(x'xyl 
\  0 


( Sml(X'X)-1  o  \ 

v  0  2jh)' 


Combining  these  results,  we  get  RiT„1(fii)R'1  «  SmLR(X'X)  {R'  so  that 


W  =  (Rb  —  r)'(sjtLR(X,X)-1R')-1(Rb  -  r) 

_  ( Rb  -  r/(R(X'X)-lR'y\Rb  -  r) 
e'e/n 

=  -^-rF.  (4.49) 

n  —  k 

The  last  equality  follows  from  (3.54)  in  Section  3.4.1  (p.  165).  This  formula,  like 
the  one  in  (4.46),  holds  true  only  for  linear  models  with  linear  restrictions.  (Some 
software  packages,  such  as  EViews,  compute  the  Wald  test  with  the  OLS  estimate 
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s2  instead  of  the  ML  estimate  s^L,  in  which  case  the  relation  (4.49)  becomes 
W  =  gF;  in  EViews,  tests  of  coefficient  restrictions  are  computed  in  two  ways,  one 
with  the  F-test  and  with  P-values  based  on  the  F{g,  n  —  k)  distribution,  and  another 
with  the  Wald  test  W  =  gF  and  with  P-values  based  on  the  y2(g)  distribution.) 


Relation  between  Wald  test  and  t- test 

For  the  case  of  a  single  restriction  (so  that  g  =  1)  we  obtained  in  Section  3.4.1 
the  result  that  F  =  t 1 .  Substituting  this  in  (4.49)  we  get  the  following  relation 
between  the  Wald  test  and  the  f-test  for  a  single  parameter  restriction: 

W  =  -^-rt2.  (4.50) 

n  —  k 

The  cause  of  the  difference  lies  in  the  different  estimator  of  the  variance  a1  of 
the  error  terms,  s^L  in  the  Wald  test  and  the  OLS  estimator  s2  in  the  f-test. 
Because  of  the  relation  s^L  =  ^s2  in  (4.34),  the  relation  (4.50)  can  also  be 
written  as 

w  =  f 2  •  -4 — . 

SML 

Exercises:  S:  4.11a,  b,  d,  e;  E:  4.13c,  4.14c,  4.15b,  4.16h-j. 


4.3.6  The  Lagrange  Multiplier  test 

Uses  Section  1.2.3;  Appendix  A. 8. 


Formulation  of  parameter  restrictions  by  means  of 
Lagrange  parameters 


As  a  third  test  we  discuss  the  Lagrange  Multiplier  test,  also  called  the  score 
test.  This  test  considers  whether  the  gradient  (also  called  the  ‘score’)  of  the 
unrestricted  likelihood  function  is  sufficiently  close  to  zero  at  the  restricted 
estimate  do.  This  test  was  discussed  in  Section  4.2.4  for  regression  models 
where  we  minimize  the  sum  of  squares  criterion  (4.11),  but  now  we  consider 
this  within  the  framework  of  ML  estimation  where  we  maximize  the  log- 
likelihood  criterion  (4.26). 

The  null  hypothesis  r(0)  =  0  imposes  g  independent  restrictions  on  9.  For 
simplicity  of  notation  we  suppose  that  the  vector  of  parameters  9  can  be  split 
in  two  parts,  9  =  ^ ,  and  that  the  restrictions  are  given  by  9i  —  0,  where 

9i  contains  g  components.  Then  the  restricted  ML  estimator  can  be  obtained 
by  maximizing  the  Lagrange  function 
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A(61,d2,A)  =  l(d1,e2)-l'62. 

Here  X  is  the  g  x  1  vector  of  Lagrange  multipliers.  The  restricted  maximum 
satisfies  the  first  order  conditions 


dA  _dl_  dA  _  dl 
W^dOt-  ’  W2~W2 


i  =  °.  §  =  fc  =  o. 


(4.51) 


So  the  Lagrange  multipliers  X  =  dl/dQi  measure  the  marginal  increase  in  the 
log-likelihood  /  if  the  restrictions  9 2  =  0  are  relaxed.  The  idea  is  to  reject  the 
restrictions  if  these  marginal  effects  are  too  large. 


Idea  of  LM- test  for  a  single  parameter 

This  is  illustrated  graphically  in  Exhibit  4.15  for  the  simple  case  of  a  single 
parameter  (g  =  1,  9  =  9 2  contains  one  component,  and  there  are  no  add¬ 
itional  components  9 1).  The  slope  X  =  dl/d9  in  9  =  0  is  related  to  the  (verti¬ 
cal)  difference  in  the  log-likelihoods  1(9)  —  1(0),  where  9  is  the  unrestricted 
ML  estimate.  This  difference  is  larger  for  smaller  curvatures  d2l/d9 1  in 
9  =  0.  This  suggests  evaluating  the  loss  in  log-likelihood,  which  results 
from  imposing  the  restriction  that  9  =  0,  by  the  LM-test  statistic 


Exhibit  4.15  Lagrange  Multiplier  test 

Graphical  illustration  of  the  Lagrange  Multiplier  test.  The  restrictions  are  rejected  if  the 
gradient  (evaluated  at  the  restricted  parameter  estimates)  differs  too  much  from  zero.  This 
is  taken  as  an  indication  that  the  loss  in  the  log-likelihood  is  too  large,  and  this  ‘vertical’ 
difference  is  larger  if  the  log-likelihood  has  a  smaller  curvature. 
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Derivation  of  LM- test  for  general  parameter  restrictions 

Now  we  return  to  the  more  general  case  in  (4.51)  and  consider  a  test  for  the  null 
hypothesis  that  02  =  0  that  is  based  on  the  magnitude  of  the  vector  of  Lagrange 
multipliers  A.  To  test  the  significance  of  the  Lagrange  multipliers  we  have  to  derive 
the  distribution  under  the  null  hypothesis  of  A  =  9l/982  (evaluated  at  the  re¬ 
stricted  ML  estimates).  This  derivation  (which  runs  till  (4.54)  below)  goes  as 
follows.  Let 


z(8)  = 


(  91/98]  \ 
\dl/982J 


be  the  gradient  vector  of  the  log-likelihood  (4.26)  for  n  observations.  Then,  under 
weak  regularity  conditions,  this  vector  evaluated  at  the  parameter  6  =  do  of  the 
DGP  has  the  property  that 


-U(0o)-iN(O,Io)-  (4.52) 

\n 

The  proof  of  asymptotic  normality  is  beyond  the  scope  of  this  book  and  is  based 
on  generalizations  of  the  central  limit  theorem.  Here  we  only  consider  the  mean 
and  variance  of  z{9 0).  To  compute  the  mean,  we  write  1(6)  =  log  (pe(y))  with 
pg(y)  —  P(y,  X,  8)  the  density  in  (4.25).  Then 


E[z(60)]  =  E 


'dlogpe(y)' 

—  f 

1  9pe(y)' 

98 

—  Hi 

le=e0 

pe0(y)  98 

Pe0(y)dpe(y) 
po0(y)  98 


dy 


1 6=t 


=  f  9  f  pe(y)dy\ 

.  V  98  J, 


0 


(4.53) 


as  f  pg(y)dy  =  1  for  every  density  function  function  of  y.  Using  (4.36)  it  then 
follows  that 


var(z(d0))  =  E 


'9[9T 
90W  , 

L  J  I  e=e. 


ln(80). 


The  two  foregoing  results  show  that  ^z(0 0)  in  (4.52)  has  mean  zero  and  covar¬ 
iance  matrix  jX„(8q).  For  n  — >  00  this  covariance  matrix  converges  to  To- 

If  the  null  hypothesis  d2  =  0  is  true  and  80  denotes  the  ML  estimator  of  8  under 
this  hypothesis,  then  do  is  a  consistent  estimator  of  do  and  (4.52)  implies  that 


^U(d0)«N(0,T0). 

\Jn 


Now  the  Lagrange  multipliers  A  in  (4.51)  are  given  by  A  =  9l/982  under  the 
restriction  that  9l/98\  =  0.  If  we  decompose  the  matrix  To  in  (4.52)  in 
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accordance  with  the  components  z\  =  dl/dQ\  and  z2  =  dl/dO 2  of  z  and  use  the 
result  (1.22)  on  conditional  distributions  of  the  normal  distribution,  it  follows 
that 


zi\zi  =  0 


N(0,  X  022  —  12)- 


If  we  denote  the  above  covariance  matrix  by  W,  it  follows  that  X  «  N(0,  V), 
where  V  =  n  W  «  I22  —  T2\L\\Z\2  is  defined  in  terms  of  Xn  in  (4.36)  with  decom¬ 
position  according  to  that  of  z  in  z\  and  Z2 ■  Therefore  XV~vX  «  x2{g).  As  the 
matrix  V-1  is  equal  to  the  lower  diagonal  block  of  the  matrix  I^1,  it  follows  from 
(4.51)  that 


LM  =  l'V_1A  = 


LM-test  in  terms  of  the  log-likelihood 

The  above  result  can  be  written  as 


LM 


(4.54) 


where  the  expressions  dl/dO  and  E[d2l/dOdO']  are  both  evaluated  at  9  =  60, 
the  ML  estimate  under  the  null  hypothesis.  The  advantage  of  the  LM- test  is 
that  only  the  restricted  ML  estimate  6q  has  to  be  computed.  This  estimate 
is  then  substituted  in  (4.54)  in  the  gradient  and  the  information  matrix  of 
the  unrestricted  model.  So  we  need  to  compute  the  gradient  and  Hessian 
matrix  of  the  unrestricted  model,  but  we  do  not  need  to  optimize  the  unre¬ 
stricted  likelihood  function.  Therefore  the  LM-test  is  attractive  if  the 
unrestricted  likelihood  function  is  relatively  complicated. 


4.3.7  LM-test  in  the  linear  model 
Model  formulation 

As  an  illustration  we  apply  the  LM- test  (4.54)  for  the  linear  model 
y  =  X/3  +  e  with  Assumptions  1-7.  The  vector  of  parameters  /(  is  split  in 
two  parts  p  =  where  P2  is  a  g  x  1  vector  and  P1  is  a  (k  —  g)  x  1 

vector.  The  model  can  be  written  as  y  =  X\P1  +  X2p2  +  e  and  we  consider 
the  null  hypothesis  that  P2  =  0. 
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Derivation  of  L/W-test  with  auxiliary  regressions 

To  compute  the  LM-test  (4.54)  for  this  hypothesis,  we  first  note  that  under  the  null 
hypothesis  the  model  is  given  by  y  =  X\b\  +  e.  According  to  the  results  in  Section 
4.3.2,  the  ML  estimates  of  this  model  are  given  by  =  bR  =  (X(  X]  )~%X\y  and 
s\  =  },e'ReR,  where  eR  =  y  —  X\bR  are  the  restricted  least  squares  residuals.  The 
gradient  Jg  in  (4.54),  evaluated  at  (ySi,  yS2, ff2)  =  (^r,  0,s|),  is  given  by  (4.31)  and 
(4.32),  so  that 

A  =  lx'l(y-x,w  =  o, 

A  =  Ixi„-XlW  =  ^X'2«, 

f)l  rt  1 

7TT=  -  +  ^r(y  -  X^r)'  (y  -  Xibn)  =  0. 


To  compute  the  information  matrix  in  (4.54),  evaluated  at  (bR,  0,  sR),  we  use  the 
second  order  derivatives  in  (4.37),  (4.38),  and  (4.39).  The  term  in  (4.37)  becomes 
—  4-X'X.  As  sR  is  a  consistent  estimator  of  er2  (under  the  null  hypothesis), 
the  expectation  of  this  term  is  approximately  also  equal  to  —  \  X'X.  The  term 
in  (4.39)  becomes  j!a  ~  ^eReR  =  — 2^,  and  the  expectation  Kof  this  term  is 
approximately  the  same.  Finally,  to  evaluate  (4.38),  note  that  in  Section  3.3.1 
(p.  152-3)  we  proved  that  the  least  squares  estimators  b  and  s2  are  independent. 
Therefore,  if  the  null  hypothesis  holds  true,  the  restricted  least  squares  estimators 
bR  ands|  are  also  independent.  The  term  in  (4.38)  is  given  by  —  \X'(y  —  X\  bn),  and 


E  \x'(y  -  X1bR)  =  E  1  X'E[y  -  X^r]  =  0, 

L4  J  L4J 

as  E[y  —  X[bR]  =  E[y]  —  X\ E[bR]  =  X\P1  —  Xi/^  =  0  for  =  0.  Combining  the 
above  results,  we  get 


^■XjXi  4x;x2  0  \ 

Ix'Xi  Xx;x2  0  = 


-4-X'X  0 

4 

0  4V 
K 


With  the  above  expressions  for  the  gradient  and  the  Hessian  matrix,  the  LM- test 
(4.54)  becomes 


LM  = 


±X'2eR 


^X'X 


sR  \X2eR  J 

e'RX(X'X)~1X'eR 


0 

±X'2eR 


£4X(XX)_1X'eR 


=  nR 2. 


(4.55) 
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Computation  of  LM-test  as  variable  addition  test 

This  is  precisely  the  result  (4.21)  that  was  obtained  in  Section  4.2.4  within 
the  setting  of  non-linear  regression  models.  This  result  holds  true  much  more 
generally  —  that  is,  in  many  cases  (non-linear  models,  non-linear  restrictions, 
non-normal  disturbances)  the  LM-test  can  be  computed  as  follows. 


Computation  of  LM-test  by  auxiliary  regressions 

•  Step  1:  Estimate  the  restricted  model.  Estimate  the  restricted  model,  with 
corresponding  residuals  eR. 

•  Step  2:  Auxiliary  regression  of  residuals  of  step  1.  Perform  a  regression  of 
eR  on  all  the  variables  in  the  unrestricted  model.  In  non-linear  models 
y  =  f(x,  ft)  +  e,  the  regressors  are  given  by  in  other  types  of  models  the 
regressors  may  be  of  a  different  nature  (several  examples  will  follow  in  the 
next  chapters). 

•  Step  3:  LM  =  nR2  of  step  2.  Then  LM  =  nR 2  w  y2(g),  where  R2  is  the  R2  of 
the  regression  in  step  2. 


Because  variables  are  added  in  step  2  to  the  variables  that  are  used  in  step  1, 
this  is  also  called  a  variable  addition  test.  The  precise  nature  of  the  variables 
to  be  used  in  the  regression  in  step  2  depends  on  the  particular  testing 
problem  at  hand.  In  the  rest  of  this  book  we  will  encounter  several  examples. 

=©  Exercises:  E:  4.14d,  4.15b. 


4.3.8  Remarks  on  tests 

Uses  Section  1.4.1. 


Comparison  of  three  tests 

In  the  foregoing  sections  we  discussed  four  tests  on  parameter  restrictions  (F, 
LR,  W,  and  LM).  In  this  section  we  give  a  brief  summary  and  we  comment 
on  some  computational  issues. 

Exhibit  4.16  (a)  gives  a  graphical  illustration  of  the  relation  between  the 
LR-,  W-,  and  LM-tests  for  the  case  of  a  single  parameter  9  with  the  null 
hypothesis  that  9  =  0.  The  W-  and  LM-tests  are  an  approximation  of  the  LR- 
test  —  that  is,  the  loss  in  log-likelihood  caused  by  imposing  the  null  hypoth¬ 
esis.  The  advantage  of  the  LM-  and  W-tests  is  that  only  one  model  needs  to 
be  estimated.  If  the  restricted  model  is  the  simplest  to  estimate,  as  is  often  the 
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(a) 

log  L 


■>  0 


(b) 


Test 

LR 

W 

LM 

Estimated  models 

2  (under  Ho  and 

1  (under  Hj) 

1  (under  H0) 

Advantage 

Optimal  power 

If  model  under 

Ho  is  complicated 

Simple  computations 
(auxiliary  regressions) 

Disadvantage 

Needs  2  optimizations 
(ML  under  H0  and  Hj) 

Test  depends  on 
parametrization 

Power  may  be  small 

Main  formula 

(4.44) 

2  logL(Hi )  —  2  logL(H0) 

(4.48) 

(generalizes  F- test) 

(4.54)  and  (4.55) 
LM  =  nR 1 

Exhibit  4.16  Comparison  of  tests 


(a)  gives  a  graphical  illustration  of  the  Likelihood  Ratio  test,  the  Wald  test,  and  the  Lagrange 
Multiplier  test.  The  LR- test  is  based  on  the  indicated  vertical  distance,  the  W-test  on  the 
indicated  horizontal  distance,  and  the  LM-test  on  the  indicated  gradient.  ( b )  contains  a 
summary  comparison  of  the  three  tests. 


case,  then  the  LM- test  is  preferred.  In  situations  where  the  unrestricted 
model  is  the  simplest  to  estimate  we  can  use  the  W-test.  Exhibit  4.16  ( b ) 
gives  a  summary  comparison  of  the  three  tests  LR,  W,  and  LM. 

In  Section  4.2.3  we  discussed  methods  for  non-linear  optimization.  In 
general  this  involves  a  number  of  iterations  to  improve  the  estimates  and  a 
stopping  rule  determines  when  the  iterations  are  ended.  In  ML  estimation 
one  can  stop  the  iterations  if  the  criterion  values  of  the  log-likelihood  do  not 
change  anymore  (this  is  related  to  the  LR- test),  if  the  estimates  do  not  change 
anymore  (this  is  related  to  the  W-test,  which  weighs  the  changes  against  the 
variance  of  the  estimates),  or  if  the  gradient  has  become  zero  (this  is  related  to 
the  LM-test,  which  weighs  the  gradient  against  its  variance). 
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Relations  between  tests 

The  relations  of  the  tests  LR,  W,  and  LM  with  the  F-test  for  a  linear 
hypothesis  in  a  linear  model  is  given  in  (4.46),  (4.49),  and  (4.23).  From 
these  expressions  the  following  inequalities  can  be  derived  for  testing  a  linear 
hypothesis  in  a  linear  model. 

LM  <  LR  <  W.  (4.56) 

This  is  left  as  an  exercise  (see  Exercise  4.6).  As  all  three  statistics  have  the 
same  asymptotic  /2(g)  distribution,  it  follows  that  the  P-values  based  on  this 
distribution  satisfy  P(LM)  >  P(LR)  >  P(W).  This  means  that,  if  the  LM- test 
rejects  the  null  hypothesis,  the  same  holds  true  for  the  LR-  and  W-tests,  and, 
if  the  W-test  fails  to  reject  the  null  hypothesis,  then  the  same  holds  true  for 
the  LM-  and  LR- tests.  It  also  follows  from  (4.23),  (4.46),  and  (4.49)  that  the 
three  tests  are  asymptotically  (for  n  — *■  oo)  equivalent  to  gF(g,  n  —  k),  and  this 
converges  in  distribution  to  a  y2(g)  distribution.  That  is,  all  four  tests  are 
asymptotically  equivalent. 

X2-  and  F-distribution  in  testing 

To  perform  the  tests  LR,  W,  and  LM,  it  is  sometimes  preferable  to  use  the 
critical  values  of  the  gF(g,n  —  k)  distribution  instead  of  those  of  the  y2(g) 
distribution.  These  critical  values  are  somewhat  larger,  so  that  the  evidence 
to  reject  the  null  hypothesis  should  be  somewhat  stronger  than  what  would 
he  required  asymptotically.  Exhibit  4.17  shows  the  5  per  cent  critical  values 
for  some  selected  degrees  of  freedom  (g,n  —  k).  This  shows  that  both 


g 

gF  (ftlO) 

gF  (g,100) 

gF  (£,1000) 

x2(g) 

1 

4.96 

3.94 

3.85 

3.84 

2 

8.21 

6.17 

6.01 

5.99 

3 

11.12 

8.09 

7.84 

7.81 

4 

13.91 

9.85 

9.52 

9.49 

5 

16.63 

11.53 

11.12 

11.07 

6 

19.30 

13.14 

12.65 

12.59 

7 

21.95 

14.72 

14.13 

14.07 

8 

24.57 

16.26 

15.58 

15.51 

9 

27.18 

17.77 

17.00 

16.92 

10 

29.78 

19.27 

18.40 

18.31 

20 

55.48 

33.53 

31.62 

31.41 

50 

131.86 

73.86 

68.16 

67.50 

100 

258.84 

139.17 

125.96 

124.34 

Exhibit  4.1 7  F-  and  ^-distributions 

The  5%  critical  values  of  the  chi-squared  distribution  (last  column)  for  some  selected  degrees 
of  freedom  ( g )  and  the  5%  critical  values  of  the  scaled  T-distribution  gF(g,  n  —  k)  for  different 
values  of  n  —  k  (10,  100,  and  1000). 
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methods  lead  to  the  same  results  for  large  sample  sizes,  but  that  for  small 
samples  the  critical  values  of  the  gF(g,n  —  k)  distribution  may  be  consider¬ 
ably  larger  than  those  of  the  x2(g)  distribution. 


Alternative  expressions  for  tests  and  information  matrix 

Sometimes  the  W-test  and  the  LM- test  are  computed  by  expressions  that  differ  from 
(4.48)  and  (4.54)  by  using  approximations  of  the  information  matrix.  Note  that 
each  of  the  values  0  =  6q  (of  the  DGP),  9  =  6 \  (ML  in  the  unrestricted  model)  and 
0  =  0O  (ML  under  the  null  hypothesis)  are  asymptotically  equal  (if  the  null  hypoth¬ 
esis  is  true),  as  ML  estimators  are  consistent  so  that  plim(0i)  =  9q  andplim(0o)  =  do- 
For  instance,  for  independent  observations  the  log-likelihood  is  given  by  (4.27)  and 
the  information  matrix  in  (4.48)  and  (4.54)  can  be  approximated  by  using 


n 

n  n 

evaluated  at  any  of  the  three  parameter  values  Oq,  dj ,  or  90.  The  last  approximation 
was  stated  in  (4.28)  and  may  be  convenient  as  it  requires  only  the  first  order 
derivatives,  but  it  may  provide  less  precise  estimates  as  compared  to  methods  that 
make  use  of  the  second  order  derivatives.  All  these  expressions  can  also  be  used  as 
approximations  of  the  asymptotic  covariance  matrix  of  the  ML  estimator  in  (4.35). 
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Exercises:  T:  4.5,  4.6b;  E:  4.13,  4.14,  4.15,  4.16. 


4.3.9  Two  examples 

We  illustrate  ML  estimation  and  testing  with  two  examples.  The  first  example 
concerns  a  linear  model  with  non-normal  disturbances,  the  second  example  a 
non-linear  regression  model  with  normally  distributed  disturbances.  Of  course 
it  is  also  of  interest  to  apply  the  LR-,  W-,  and  LM-tests  for  a  linear  hypothesis 
in  a  linear  model  and  to  compare  the  outcomes  with  the  F- test  of  Chapter  3. 
This  is  left  for  the  exercises  (see  Exercises  4.14  and  4.15).  Further,  ML  has 
important  applications  for  other  types  of  models  that  cannot  be  expressed  as  a 
regression.  Such  applications  will  be  discussed  in  Chapters  6  and  7. 

Example  4.5:  Stock  Market  Returns  (continued) 

We  consider  again  the  CAPM  for  the  sector  of  cyclical  consumer  goods 
of  Example  4.4  in  Section  4.3.1.  We  will  discuss  (i)  the  specification  of 
the  log-likelihood  for  t(5)-distributed  disturbances,  (ii)  outcomes  of  the  ML 
estimates,  and  (iii)  choice  of  the  number  of  degrees  of  freedom  in  the  t- 
distribution. 
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(i)  Log-likelihood  for  (scaled)  t(5)-distributed  disturbances 

As  was  discussed  in  Example  4.4  (p.  223-4),  the  disturbance  terms  in 
the  CAPM  may  have  fatter  tails  than  the  normal  distribution  (see  also 
Exhibit  4.11  (a)).  As  an  alternative,  we  consider  the  same  linear  model 
with  disturbances  that  have  the  (scaled)  f(5)-distribution.  That  is,  the 
model  is  given  by 


y,  =  a  +  fix,  +  s„ 

where  y,  and  x,  are  the  excess  returns  in  respectively  the  sector  of  cyclical 
consumer  goods  and  the  whole  market.  We  suppose  that  Assumptions  1-6  of 
Section  3.1.4  (p.  125)  are  satisfied.  In  particular,  the  disturbance  terms  s,  have 
zero  mean,  they  have  equal  variance,  and  we  assume  that  they  are  mutually 
independent.  As  independence  implies  being  uncorrelated,  this  is  stronger 
than  Assumption  4  of  uncorrelated  disturbance  terms.  The  postulated  scaled 
£(5) -density  of  the  disturbance  terms  is 

P(e,)  =  c5(  1  +  Ej/5a2r3/a, 

where  cs  is  a  scaling  constant  (that  does  not  depend  on  a)  so  that 
f p(sj)dsj  =  1.  The  log-likelihood  (4.27)  is  given  by 


/(a,  /?,  a1)  =  5>g(p(£,))  =  n  log  (c5)  -  -log  (a2 
i=i 

-3±loJl  +  ' 


(yt  -a-  fixY 


1=1 


5  (T2 


(ii)  ML  estimates  based  on  (scaled)  t(5)-distribution 

The  first  order  derivatives  of  the  above  log-likelihood  are  given  by 
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Substituting  £,  =  y,  —  a  —  fix,,  the  ML  estimates  are  obtained  by  solving  the 
above  three  non-  linear  equations  81/ da  =  dl/dft  =  dl/da 2  =  0.  The  out¬ 
comes  (<3ml,^ml,  sml)  of  the  BHHH  algorithm  of  Section  4.3.2  are  given 
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Exhibit  4.18  Stock  Market  Returns  (Example  4.5) 


ML  estimates  of  CAPM  for  the  sector  of  cyclical  consumer  goods  in  the  UK,  using  a  scaled  t(5) 
distribution  for  the  disturbances.  ( a)-(c )  show  the  estimates  of  the  constant  term  a  (a),  the  slope 
P  (b),  and  the  scale  parameter  a  { c )  obtained  by  twenty  iterations  of  the  BHHH  algorithm,  with 
starting  values  a  =  0,  P  =  1,  and  a  =  \ .  (d)-(e)  show  the  values  of  SSR  (d)  and  of  the  log- 
likelihood  values  (denoted  by  LL  (e))  obtained  in  these  iterations.  The  value  of  LL  increases  at 
each  iteration,  but  the  value  of  SSR  does  not  decrease  always,  (f)  shows  the  histogram  of  the  ML 
residuals,  and  (g)  shows  the  maximum  of  the  log-likelihood  function  for  the  t(d)  distribution  for 
different  degrees  of  freedom  (the  optimal  value  is  obtained  for  d  =  8,  and  for  d  infinitely  large 
(the  case  of  the  normal  distribution)  the  LL  value  is  indicated  by  the  horizontal  line). 


in  Exhibit  4.18  (a-c),  together  with  a  histogram  of  the  ML  residuals 
e,  =  y,  —  aML  ~  bMLXi  in  (f).  The  iterations  are  started  in  (a,  /?,  a)  = 
(0,  1,  1)  and  converge  to  (uml,  bML,  sml )  =(  -  0.34, 1.20,  4.49). 
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(iii)  Choice  of  degrees  of  freedom  of  f-distribution 

The  motivation  to  use  the  (scaled)  t(5)-distribution  instead  of  the  normal 
distribution  is  that  the  disturbance  distribution  may  have  fat  tails.  However, 
we  have  no  special  reason  to  take  the  t-distribution  with  d  =  5  degrees  of 
freedom.  Therefore  we  estimate  the  CAPM  also  with  scaled  t(d  ^distributions 
for  selected  values  of  d,  including  d  =  oo  (which  corresponds  to  the  normal 
distribution,  see  Section  1.2.3  (p.  33)).  This  can  be  used  for  a  grid  search  for  d 
to  obtain  ML  estimates  of  the  parameters  (a,  /?,  a1,  d).  Exhibit  4.18  (g) 
shows  the  maximum  of  the  log-likelihood  for  different  values  of  d.  The 
overall  optimum  is  obtained  for  d  =  8.  The  difference  in  the  log-likelihood 
with  d  —  S  is  rather  small.  We  can  also  test  for  the  null  hypothesis  of  normally 
distributed  error  terms  against  the  alternative  of  a  t(d)-distribution —  that  is, 
the  test  of  d  =  oo  against  d  <  oo.  The  Likelihood  Ratio  test  is  given  by 

LR  =  2/(0!)  -  21(6  q)  =  2(  -  747.16  +  750.54)  =  6.77, 

where  1(6 1)  =  —747.16  is  the  unrestricted  maximal  log-likelihood  value  (that 
is,  for  d  —  8)  and  /(0o)  =  —750.54  is  the  log-likelihood  value  for  the  model 
with  normally  distributed  disturbances.  Asymptotically,  LR  follows  the  y2(l) 
distribution.  The  P-value  of  the  computed  LR- test  is  P  =  0.009,  so  that  the 
null  hypothesis  is  rejected.  Therefore  we  conclude  that,  under  the  stated 
assumptions,  a  t-distribution  may  be  more  convenient  to  model  the  disturb¬ 
ances  of  the  CAPM  than  a  normal  distribution.  In  Section  4.4.6  we  provide  a 
further  comparison  between  the  models  with  normal  and  with  scaled  t(5)- 
disturbances. 
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Example  4.6:  Coffee  Sales  (continued) 

As  a  second  example  we  consider  again  the  data  on  sales  of  two  brands  of 
coffee  discussed  before  in  Section  4.2.5.  We  will  discuss  (i)  the  outcomes  of 
ML  estimation  for  the  two  brands,  (ii)  LR- tests  on  constant  elasticity,  (iii) 
LM-tests  on  constant  elasticity,  (iv)  Wald  tests  on  constant  elasticity,  and  (v) 
comparison  of  the  tests  and  conclusion. 


(i)  Outcomes  of  ML  for  the  two  brands 

For  each  of  the  two  brands  separately,  we  use  the  non-linear  regression  model 
(4.9)  with  the  assumption  of  normally  distributed  disturbances,  so  that 


log  (q,)  =  +  j-  (df3  -  1)  + 


e,  ~  NID(0,  a1). 


The  null  hypothesis  of  constant  demand  elasticity  corresponds  to  the  parameter 
restriction  p 3  =  0,  with  corresponding  model  log  (q)  =  /?i  +  /^2  log  ( d )  +  e. 
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We  perform  different  tests  of  this  hypothesis  for  both  brands  of  coffee.  The 
tests  are  based  on  the  results  in  Exhibit  4.19.  Panels  1  and  3  give  the  results  of 
ML  estimation  under  the  hypothesis  that  /?3  =  0.  Because  the  disturbances 
are  assumed  to  be  normally  distributed,  ML  corresponds  to  least  squares  (see 
Section  4.3.2).  Panels  2  and  4  give  the  results  of  ML  estimation  in  the 
unrestricted  non-linear  regression  model  (4.9).  This  corresponds  to  non¬ 
linear  least  squares. 

(ii)  LR- tests  on  constant  elasticity 

The  Likelihood  Ratio  tests  for  the  null  hypothesis  that  /?3  =  0  against  the 
alternative  that  /?3  ^  0  can  be  obtained  from  the  results  in  Exhibit  4.19  for 
brands  1  and  2.  The  results  are  as  follows,  with  P-values  based  on  the 
asymptotic  y2(  1)  distribution: 

LRi  =  2(12.530  -  10.017)  =  5.026  (P  =  0.025), 

LR2  =  2(11.641  -  10.024)  =  3.235  (P  =  0.072). 

(iii)  LM-tests  on  constant  elasticity 

Under  the  null  hypothesis,  the  model  is  linear  with  dependent  variable  log  (q,) 
and  with  explanatory  variable  log(J,)  (see  Example  4.2  (p.  202-4)). 
The  Lagrange  Multiplier  test  for  non-linear  regression  models  has  already 
been  performed  in  Section  4.2.5  for  both  brands  of  coffee,  with  the  results  in 
Panels  5  and  6  in  Exhibit  4.10.  The  test  outcomes  are 

LMi  =  nR 2  =  12  ■  0.342  =  4.106  (P  =  0.043), 

LM2  =  nR 2  =  12  ■  0.236  =  2.836  (P  =  0.092). 


(iv)  Wald  tests  on  constant  elasticity 

To  compute  the  Wald  test  (4.48)  we  use  the  relation  (4.50)  between  the  Wald 
test  and  the  t-test  —  that  is, 


The  non-linear  regressions  in  Panels  2  and  4  in  Exhibit  4. 10  show  the  t-values 
of  /?3  with  P-values  based  on  the  t(9)-distribution  —  namely, 


h  =  -2.012  (P  =  0.075), 
t2  =  -1.651  (P  =  0.133). 


Using  (4.50)  with  n  —  12  and  k  —  3,  this  leads  to  the  following  values  for  the 
Wald  test,  with  corresponding  P-values  based  on  the  /2(  1)  distribution: 
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Panel  1:  Dependent  Variable:  LOGQ1  (brand  1) 

Method:  Least  Squares 

Included  observations:  12 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

LOGD1 

5.841739 

4.664693 

0.043072 

0.581918 

135.6284 

8.016063 

0.0000 

0.0000 

R-squared 

Sum  squared  resid 
Log  likelihood 

0.865333 

0.132328 

10.01699 

Panel  2:  Dependent  Variable:  LOGQ1  (brand  1) 

Method:  Least  Squares 

Included  observations:  12 

Convergence  achieved  after  5  iterations 

ILOGQ1  =  C(l)  +  (C(2)/C(3))  * 

(D1AC(3)—  1) 

Parameter 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C(l) 

5.807118 

0.040150 

144.6360 

0.0000 

C(2) 

10.29832 

3.295386 

3.125072 

0.0122 

C(3) 

-13.43073 

6.674812  - 

-2.012152 

0.0751 

R-squared 

0.911413 

Sum  squared  resid 

0.087049 

Log  likelihood 

12.52991 

Panel  3:  Dependent  Variable:  LOGQ2  (brand  2) 

Method:  Least  Squares 

Included  observations:  12 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

LOGD2 

4.406561 

6.003298 

0.043048 

0.581599 

102.3638 

10.32206 

0.0000 

0.0000 

R-squared 

Sum  squared  resid 
Log  likelihood 

0.914196 

0.132183 

10.02358 

Panel  4:  Dependent  Variable:  LOGQ2  (brand  2) 

Method:  Least  Squares 

Included  observations:  12 

Convergence  achieved  after  5  iterations 

LOGQ2  =  C(l)  +  (C(2)/C(3))  *  (D2AC(3)-  1) 

Parameter 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C(l) 

4.377804 

0.043236 

101.2540 

0.0000 

C(2) 

10.28864 

3.001698 

3.427608 

0.0075 

C(3) 

-8.595289 

5.20720 6  - 

-1.650653 

0.1332 

R-squared 

0.934474 

Sum  squared  resid 

0.100944 

Log  likelihood 

11.64129 

Exhibit  4.19  Coffee  Sales  (Example  4.6) 

Regressions  for  two  brands  of  coffee,  models  with  constant  elasticity  (Panels  1  and  3)  and 
models  with  varying  elasticity  (Panels  2  and  4). 
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Wi  =  —  •  (-  2.012)2  =  5.398  (P  =  0.020), 

W2  =  y-  (-  1.651)2  =  3.633  (P  =  0.057). 

(v)  Comparison  of  tests  and  conclusion 

Summarizing  the  outcomes  of  the  test  statistics,  note  that  for  both  brands  of 
coffee  LM  <  LR  <  W,  in  accordance  with  (4.56).  If  we  use  a  5  per  cent 
significance  level,  the  null  hypothesis  of  constant  demand  elasticity  is  not 
rejected  for  brand  2,  but  it  is  rejected  for  brand  1  by  the  LR- test,  the  LM-test, 
and  the  W-test,  but  not  by  the  f-test. 

As  the  sample  size  (n  =  12)  is  very  small,  the  asymptotic  y2(  1)  distribution 
is  only  a  rough  approximation.  It  is  helpful  to  consider  also  the 
gF(g,  n  —  k)  =  F(  1,  9)  distribution  with  5  per  cent  critical  value  equal  to 
5.12.  This  is  considerably  larger  than  the  value  3.84  for  the  y2(  1)  distribu¬ 
tion.  With  this  critical  value  of  5.12,  all  tests  fail  to  reject  the  null  hypothesis, 
with  the  exception  of  the  Wald  test  for  brand  1.  Therefore,  on  the  basis  of 
these  data  there  is  not  so  much  compelling  evidence  to  reject  the  null 
hypothesis  of  constant  elasticity  of  the  demand  for  coffee.  Of  course,  the 
number  of  observations  for  the  two  models  (n  =  12  for  both  brands)  is  very 
small,  and  in  Section  5.3.1  (p.  307-10)  we  will  use  a  combined  model  for  the 
two  brands  (so  that  n  =  24  in  this  case).  As  we  shall  see  in  Section  5.3.1, 
the  null  hypothesis  of  constant  elasticity  can  then  be  rejected  for  both  brands 
by  all  three  tests  (LR,  LM,  and  Wald). 
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4.4  Generalized  method  of 
moments 


4.4.1  Motivation 

Requirements  for  maximum  likelihood 

The  results  in  the  foregoing  section  show  that  maximum  likelihood  has 
(asymptotically)  optimal  properties  for  correctly  specified  models.  In  practice 
this  means  that  the  joint  probability  distribution  (4.25)  of  the  data  should  be 
a  reasonable  reflection  of  the  data  generating  process.  If  there  is  much 
uncertainty  about  this  distribution,  then  it  may  be  preferable  to  use  an 
estimation  method  that  requires  somewhat  less  information  on  the  DGP.  In 
general,  by  making  less  assumptions  on  the  DGP,  some  efficiency  will  be  lost 
as  compared  to  ML  in  the  correct  model.  However,  this  loss  may  be  relatively 
small  compared  to  the  loss  of  using  ML  in  a  model  that  differs  much  from 
the  DGP. 

Evaluation  of  accuracy  of  estimates 

The  accuracy  of  parameter  estimates  is  usually  evaluated  in  terms  of  their 
standard  errors  and  their  P-values  associated  with  tests  of  significance.  Until 
now  we  have  discussed  two  methods  for  this  purpose.  The  expression 

var  (b)  =  s2(XlX)~1 

provides  correct  P-values  on  the  significance  of  least  squares  estimates  if  the 
seven  standard  Assumptions  1-7  of  Section  3.1.4  are  satisfied.  Further,  the 
expression 


var  (Oml)  =  Z„1(9ml) 

provides  asymptotically  correct  P-values  on  the  significance  of  maximum 
likelihood  estimates  if  the  joint  probability  function  p(y,  X,  9)  of  the  data  is 
correctly  specified  (see  Section  4.3.3). 
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In  this  section  we  discuss  the  generalized  method  of  moments  (GMM).  In 
this  approach  the  parameters  are  estimated  by  solving  a  set  of  moment 
conditions.  As  we  shall  see  below,  both  OLS  and  ML  can  be  seen  as  particu¬ 
lar  examples  of  estimators  based  on  moment  conditions.  The  GMM  standard 
errors  are  computed  on  the  basis  of  the  moment  conditions  and  they  provide 
asymptotically  correct  P-values,  provided  that  the  specified  moment  condi¬ 
tions  are  valid.  For  instance,  one  can  estimate  the  parameters  by  OLS  and 
compute  the  GMM  standard  errors  even  if  not  all  the  Assumptions  1-7 
hold  true.  One  can  also  estimate  the  parameters  by  ML  and  compute  the 
GMM  standard  errors,  even  if  the  specified  probability  distribution  is  not 
correct.  That  is,  GMM  can  be  used  to  compute  reliable  standard  errors  and 
P-values  in  situations  where  some  of  the  assumptions  of  OLS  or  ML  are 
not  satisfied. 

Example  4.7:  Stock  Market  Returns  (continued) 

As  an  illustration,  Exhibit  4.20  shows  the  OLS  residuals  of  the  CAPM 
discussed  in  Examples  2.1,  4.4,  and  4.5.  It  seems  that  the  disturbances  have 
a  larger  variance  at  the  beginning  and  near  the  end  of  the  observation  period 
as  compared  to  the  middle  period.  If  the  variances  differ,  then  the  disturbances 
are  heteroskedastic  and  Assumption  3  is  violated.  We  have  already  concluded 
(see  Sections  4.3.1  and  4.3.9)  that  Assumption  7  of  normally  distributed 
disturbances  is  also  doubtful.  However,  the  alternative  of  ML  based  on 
t-distribution  does  not  take  the  apparent  heteroskedasticity  of  the  disturb¬ 
ances  into  account  either.  It  seems  preferable  to  evaluate  the  CAPM,  in 
particular  to  compute  the  standard  errors,  without  making  such  assump¬ 
tions.  In  Section  4.4.6  we  will  use  GMM  for  this  purpose. 
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Exhibit  4.20  Stock  Market  Returns  (Example  4.7) 

Least  squares  residuals  of  CAPM  for  the  sector  of  cyclical  consumer  goods  in  the  UK. 
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4.4.2  GMM  estimation 

Uses  Section  1.3.1. 

Method  of  moments  estimator  of  the  mean 

In  Section  1.3.1  (p.  39)  we  discussed  the  method  of  moments,  which  is  based 
on  estimating  population  moments  by  means  of  sample  moments.  For 
example,  suppose  that  the  data  y,  consist  of  a  random  sample  from  a 
population  with  unknown  mean  n,  so  that 


E[y,  ~  A*]  =  o. 


Then  the  moment  estimator  of  /t  is  obtained  by  replacing  the  population 
mean  (E)  by  the  sample  mean  (^X^Li),  so  that 


1 

n 


Y  (y>  ~  A)  =  o, 

i=  1 


that  is,  A  =  ^E”=i  3'1- 

Least  squares  derived  by  the  method  of  moments 

The  least  squares  estimator  in  the  linear  model  (4.1)  can  also  be  derived  by 
the  method  of  moments.  The  basic  requirement  for  this  estimator  is  the 
orthogonality  condition  (4.4).  Here  ^X'e  =  x,e,,  where  x,  is  the 

k  x  1  vector  of  explanatory  variables  for  the  zth  observation,  and  condition 
(4.4)  is  satisfied  (under  weak  regularity  conditions)  if  E[x,Sj]  =  0  for  all  i.  As 
£,  =  y,  —  x'tp,  this  is  equivalent  to  the  condition  that 

E[xi(yi-x'ip)]  =  0,  i  =  l,  •••,«.  (4.58) 

Note  that  x,  is  a  k  x  1  vector,  so  that  this  imposes  k  restrictions  on  the 
parameter  vector  fl.  The  corresponding  conditions  on  the  sample  moments 
(replacing  the  population  mean  E  by  the  sample  mean  jtYTi= l)  giyes  the  k 
equations 


-Yxi(y>  ~  x'iP)  =  o. 

n  i 

i=i 

This  can  be  written  as  X'(y  —  Xfi)  =  0,  so  that  p  =  b  is  equal  to  the  least 
squares  estimator.  This  shows  that  OLS  can  be  derived  by  the  method  of 
moments,  using  the  orthogonality  conditions  (4.58)  as  moment  conditions. 
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ML  as  methods  of  moments  estimator 

ML  estimators  can  also  be  obtained  from  moment  conditions.  Suppose  that 
the  data  consist  of  n  independent  observations,  so  that  the  log-likelihood  is 
1(9)  =  XXi  h(6),  as  in  (4.27).  By  the  arguments  in  (4.53),  replacing 
log  p(6)  by  li(6)  =  log  pg(y„xt),  it  follows  that 


E 


dli 

86 


=  0 , 


J  \e=e0 


i  =  1,  ■  •  • ,  n. 


(4.59) 


Replacing  the  population  mean  E  by  the  sample  mean  \  YTi= i,  this  gives 


1 

n 


y^=o. 


i=l 


(4.60) 


The  solution  of  these  equations  gives  the  ML  estimator,  as  this  corresponds 
to  the  first  order  conditions  for  a  maximum  of  the  log-likelihood.  The 
equations  (4.60)  require  that  the  sample  mean  of  the  terms  ||  is  equal  to 
zero.  Such  equations  are  called  ‘generalized’  moment  conditions. 


The  generalized  method  of  moments 

We  now  describe  the  generalized  method  of  moments  more  in  general.  The 
basic  assumption  is  that  we  can  formulate  a  set  of  moment  conditions.  Sup¬ 
pose  that  the  parameter  vector  of  interest,  9,  contains  p  unknown  parameters 
and  that  the  DGP  has  parameters  do-  Further  suppose  that  for  each  observa¬ 
tion  (i  =  1,  •  •  • ,  n)  the  DGP  satisfies  m  distinct  moment  conditions,  say 

£Wo)]  =  0,  *=1,  •■•,«,  (4.61) 

where  the  g,  are  known  functions  gj :  Rp  — t  Rm  that  depend  on  the  observed 
data.  That  is,  the  crucial  assumption  is  that  the  DGP  satisfies  the  m  restric¬ 
tions  in  (4.61)  for  the  observations  i  =  1,  •  •  • ,  n.  Examples  are  the  orthogon¬ 
ality  conditions  (4.58)  (which  corresponds  to  k  linear  functions  in  the  k 
unknown  parameters)  and  the  first  order  conditions  (4.60)  (which  gives  p 
non-linear  functions  in  the  p  unknown  parameters).  If  the  number  of 
moment  conditions  m  is  equal  to  the  number  of  unknown  parameters  p  in 
9,  then  the  model  (4.61)  is  called  exactly  identified,  and  if  m  >  p  then  the 
model  is  called  over-identified.  The  GMM  estimator  9  is  defined  as  the 
solution  of  the  m  equations  obtained  by  replacing  the  population  mean  E 
in  (4.61)  by  the  sample  mean  F^"=1  —  that  is, 

-yg,(e)  =  o. 

1=1 


(4.62) 
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Numerical  aspects  of  GMM 

To  obtain  a  solution  for  9,  we  need  in  general  to  impose  at  least  as 
many  moment  conditions  as  there  are  unknown  parameters  (m  >  p ).  In  the 
exactly  identified  case  (m  =  p),  this  system  of  m  equations  in  p  unknown 
parameters  has  a  unique  solution  (under  suitable  regularity  conditions).  The 
numerical  solution  methods  discussed  in  Section  4.2.3  can  be  used  for  this 
purpose.  In  the  over-identified  case  ( m  >  p)  there  are  more  equations  than 
unknown  parameters  and  there  will  in  general  exist  no  exact  solution  of  this 
system  of  equations.  That  is,  although  the  m  (population)  conditions  (4.61) 
are  satisfied  (by  assumption)  for  the  DGP  —  that  is,  for  9  =  9q — there  often 
exists  no  value  9  for  which  the  sample  condition  (4.62)  is  exactly  satisfied. 
Tet  the  m  x  1  vector  Gn(9)  be  defined  by 

G„(0)  =  J>(0). 

i=  1 

If  there  exists  no  value  of  9  so  that  G„(9)  =  0,  one  can  instead  minimize  the 
distance  of  this  vector  from  zero,  for  instance  by  minimizing  ^G'„(9)G„(9)  with 
respect  to  9.  As  an  alternative  one  can  also  minimize  a  weighted  sum  of  squares 

-G'nWGn,  (4.63) 

n 

where  W  is  an  m  x  m  symmetric  and  positive  definite  matrix.  In  the  exactly 
identified  case  (with  a  solution  Gn(9)  =  0)  the  choice  of  W  is  irrelevant,  but 
in  the  over-identified  case  it  may  be  chosen  to  take  possible  differences  in 
sampling  variation  of  the  individual  moment  conditions  into  account.  In 
general  the  minimization  of  (4.63)  will  be  a  non-linear  optimization  problem 
that  can  be  solved  by  the  numerical  methods  discussed  in  Section  4.2.3  —  for 
example,  by  Newton-Raphson. 

Summary  of  computations  in  GMM  estimation 

Estimation  by  GMM  proceeds  in  the  following  two  steps. 

GMM  estimation 

•  Step  1:  Specify  a  sufficient  number  of  moment  conditions.  Identify  the  p 
parameters  of  interest  9  and  specify  m  (>p)  moment  conditions  (4.61). 
The  crucial  assumption  is  that  the  DGP  satisfies  these  moment  conditions. 

In  particular,  the  specified  moments  should  exist. 

•  Step  2:  Estimate  the  parameters.  Estimate  9  by  GMM  by  solving  the 
equations  (4.62)  (in  the  exactly  identified  case  with  m  =  p)  or  by  minimiz¬ 
ing  (4.63)  (in  the  over-identified  case  with  m  >  p).  The  choice  of  the 
weighting  matrix  W  (when  m  >  p)  will  be  discussed  in  the  next  section. 
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4.4.3  GMM  standard  errors 

An  asymptotic  result 

To  apply  tests  based  on  GMM  estimators  we  need  to  know  (asymptotic)  expres¬ 
sions  for  the  covariance  matrix  of  these  estimators.  In  our  analysis  we  will  assume 
that  the  moment  conditions  are  valid  for  the  DGP,  that  our  GMM  estimator  is 
consistent,  and  that  the  sample  average  \G„  =\Yl’i=iSi  satisfies  the  following 
central  limit  theorem: 

—7=  G„(6o)  —>  N(0,/0),  h  =  E\gim&m\-  (4-64) 

V« 

These  assumptions  hold  true  under  suitable  regularity  assumptions  on  the 
moment  conditions  (4.61)  and  on  the  correlation  structure  of  the  data  generating 
process  (in  particular,  the  random  vectors  gj(0 o)  should  satisfy  the  moment  condi¬ 
tions  (4.61)  and  these  random  vectors  should  not  be  too  strongly  correlated 
for  /  =  1,  •  •  • ,  n  with  n  — >  oo).  Note  that  £[G„(do)]  =  0  if  (4.61)  is  valid.  It 
falls  beyond  the  scope  of  this  book  to  treat  the  required  assumptions  for  asymp¬ 
totic  normality  in  (4.64)  in  more  detail.  However,  for  two  special  cases  (OTS  and 
MT),  the  result  (4.64)  follows  from  earlier  results  in  this  chapter,  as  we  shall 
now  show. 

Illustration  of  asymptotic  result:  OLS 

If  the  moment  conditions  are  those  of  OLS  in  (4.58),  it  follows  that 

n  n 

Gn(do)  =  ^ ~2xi(yi  -  x\p)  =  ^Xi&i  =  X's. 

i=  1  i=  1 

Under  appropriate  conditions  (Assumptions  1*,  2-6,  and  orthogonality  between 
x,  and  Si)  there  holds  Jo  =  £[x;e2x']  =  <j2E[x,x']  =  er2plim(i  J2'i= l  xix'i)  =  ff2Q  an£( 
then  (4.64)  follows  from  (4.6)  in  Section  4.1.4,  which  states  that 

^X'£4n(0,  u2Q). 

Second  illustration  of  asymptotic  result:  ML 

If  the  moment  conditions  are  those  of  ML  in  (4.60),  it  follows  that 

r  /A  )  =  V-  =  — 
o)  jzjde  9=e0  do  e=8o' 

Now  (4.52)  in  Section  4.3.6  states  that  for  z(do)  =  |fl=0o  there  holds 
■^jz(fio)  — >  N(0,  To)-  This  result  is  equivalent  to  (4.64),  because  z(9 o)  =  G„(d o)  and 
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Jo  =  E 


dj  dli 

-  Jim (— -VE 

d2li 

89  W 

l 

8989' 

=  T0 


(see  (4.36)  and  (4.57)). 


Derivation  of  asymptotic  distribution  of  the  GMM  estimator 

Assuming  that  (4.64)  is  satisfied,  it  follows  that  for  large  enough  samples  (so  that  9 
is  close  to  do)  the  minimization  problem  in  (4.63)  can  be  simplified  by  the 
linearization  G„  =  G„(6)  w  G„o  +  H„o(9  —  9o),  where  G„o  =  G„(9o)  and 
Hn o  =  Hn(6o)  is  the  m  x  p  matrix  defined  by  H„  =  dGn/89'.  Substituting  this 
linear  approximation  in  (4.63)  and  using  the  fact  that  the  derivative  of 
G„o  +  H„q ( 9  —  do)  with  respect  to  6  is  equal  to  H„ o,  the  first  order  conditions  for 
a  minimum  of  (4.63)  are  given  by 


H'n0W(Gno  +  Hno(e-6o))  =  0. 


The  solution  is  given  by 

0  =  Oo  -  (KoWH^H^WGno. 

Suppose  that  plim  (1  H„o)  =  Hq  exists,  then  it  follows  from  the  above  expression 
and  (4.64)  that 


Md  -  do)^N(0,  V),  (4.65) 

where  V  =  (H'QWHo^H'gWJoWHoiH'vWHor1. 


Choice  of  weighting  matrix  in  the  over-identified  case 

The  weighting  matrix  W  in  (4.63)  can  now  be  chosen  so  that  this  expression  is 
minimal  (in  the  sense  of  positive  semidefinite  matrices)  to  get  an  asymptotically 
efficient  estimator.  Intuitively,  it  seems  reasonable  to  allow  larger  errors  for 
estimated  parameters  that  contain  more  uncertainty.  We  can  then  penalize  the 
deviations  of  G„  from  zero  less  heavily  in  directions  that  have  a  larger  variance. 
This  suggests  choosing  the  weights  inversely  proportional  to  the  covariance 
matrix  var(^G„($o))  ~  Jo — that  is,  taking  W  =  Jq  1 .  It  is  left  as  an  exercise 
(see  Exercise  4.7)  to  show  that  this  is  indeed  the  optimal  weighting  matrix.  The 
resulting  p  x  p  asymptotic  covariance  matrix  is  given  by 


V  =  (H'qJq1  Hq)1  .  (4.66) 

So  the  estimator  9  obtained  by  minimizing  (4.63)  with  W  =  Jq1  is  the  most 
efficient  estimator  within  the  class  of  GMM  estimators  obtained  by  minimizing 
(4.63)  for  any  positive  definite  matrix  W. 
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Factors  that  influence  the  variance  of  GMM 

The  efficiency  of  this  estimator  further  depends  on  the  set  of  moment  conditions 
that  has  been  specified.  Stated  in  general  terms,  the  best  moment  conditions  (that 
is,  with  the  smallest  covariance  matrix  V  for  9)  are  those  for  which  Ho  is  large  and 
Jo  is  small  (all  in  the  sense  of  positive  definite  matrices) .  Here  Ho  =  dG / 88'  is  large 
when  the  violation  of  the  moment  conditions  (4.61)  is  relatively  strong  for 
8  ^  do  — that  is,  when  the  restrictions  are  powerful  in  this  sense.  And  Jo  is  small 
when  the  random  variation  of  the  moments  gj(9 o)  in  (4.61)  is  small. 


Illustration:  OLS 

As  an  illustration,  for  the  OLS  moment  conditions  E[x,(y,-  —  x'fi)]  =  0  in  (4.58) 
we  obtain  Hn  =  dG,,/ <9/1  =  —  Y^i=  l  xix'i  =  ~ X'X  and  (under  Assumptions  1*,  2-6, 
and  orthogonality  of  the  regressors  x,  with  the  disturbances  £,) 
Hq  =  plim(— iX'X)  =  — Q.  We  showed  earlier  that  Jo  =  o2Q  in  this  case,  so 
that 


Vols  =  (H'oJo'Hor1  =  (Qv-ZQ-'Q)-1  =  <r2Q-1  «  <T2 Qx'xj  . 

This  agrees  with  (4.7)  in  Section  4.1.4.  So  this  estimator  is  more  efficient  if  X'X  is 
larger  (more  systematic  variation)  and  if  a2  is  smaller  (less  random  variation). 


Second  illustration:  ML 

o2 1 

For  the  ML  moment  conditions  (4.60)  we  obtain  Hn  =  dG„/dO  =  Y/ 
and  it  follows  from  (4.57)  that  Ho  =  plim(TfT„o)  =  plim(—  =  — T0.  We 
showed  earlier  that  in  this  case  Jo  =  To,  so  that  for  ML  there  holds  H0  =  —Jo  and 


Vml  =  (H'oJo'Ho)-1  =  (lolo'lor1  =  To1- 

This  is  in  line  with  (4.35)  in  Section  4.3.3.  So  ML  estimators  are  efficient  if 
the  information  matrix  To  is  large,  or,  equivalently,  if  the  log-likelihood  has  a 
large  curvature  around  Do-  This  is  also  intuitively  evident,  as  for  9  ^  8q  the  log- 
likelihood  values  drop  quickly  if  the  curvature  is  large. 


Iterative  choice  of  weights 

In  practice  9o  is  unknown,  so  that  we  cannot  estimate  9  with  the  criterion  (4.63) 
with  W  =  Jq1.  A  possible  iterative  method  is  to  start,  for  instance,  with  W  =  I 
(the  m  x  m  identity  matrix)  and  to  minimize  (4.63).  The  resulting  estimate  8  is 
then  used  to  compute  Jo  =  as  an  estimate  of  Jo-  Then  (4.63) 

is  minimized  with  W  =  Jq1,  and  this  process  is  repeated  until  the  estimates 
converge. 
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GMM  standard  errors 

Consistent  estimates  of  the  standard  errors  of  the  GMM  estimators  9  are 
obtained  as  the  square  roots  of  the  diagonal  elements  of  the  estimated 
covariance  matrix  of  9  —  that  is, 


vTr (9)  =  (H'n  J-'Hn)-1, 

Jn  =  ±gm^  Hn  =  Yl 


i=  1 


i=  1 


dgi(O) 

09' 


(4.67) 

(4.68) 


Here  we  used  the  fact  that,  according  to  (4.65),  9  has  covariance 
matrix  approximately  equal  to  \  V,  and  Ho  and  /o  in  (4.66)  are  approximated 
by  jj  Hn  and  j-J,,  evaluated  at  9  =  9,  so  that 


var(0)  ~  -  V  =  -(Hq/q  1Hoi 
n  n 


-l 


=  (HX'H. 


The  covariance  matrix  in  (4.67)  is  called  the  sandwich  estimator  of  the 
covariance  matrix  of  the  GMM  estimator  9. 


Test  of  moment  conditions:  The  J-test 

In  the  over-identified  case,  one  can  test  the  over-identifying  restrictions  by 
means  of  the  result  that,  under  the  null  hypothesis  that  the  moment  condi¬ 
tions  (4.61)  hold  true, 

G'nJ-'Gntzftm-p).  (4.69) 

This  is  called  the  J-test.  Here  m  is  the  number  of  moment  conditions  and  p  is 
the  number  of  parameters  in  9.  The  result  of  the  ^-distribution  is  based  on 
(4.64),  where  Jo  is  approximated  by  j-Jn.  Note  that  (4.63)  can  be  seen  as  a 
non-linear  least  squares  problem  with  m  ‘observations’  and  p  parameters, 
which  explains  that  the  number  of  degrees  of  freedom  is  m  —  p.  In  the  exactly 
identified  case  ( m  =  p)  the  moment  conditions  cannot  be  tested,  as  G„(9)  will 
be  identically  zero  irrespective  of  the  question  whether  the  imposed  moment 
conditions  are  correct  or  not. 


Summary  of  computations  in  GMM  estimation  and  testing 

Summarizing  the  results  on  GMM  estimation  and  testing  obtained  in  this  and 
the  foregoing  section,  this  approach  consists  of  the  following  steps. 

GMM  estimation  and  testing 

•  Step  1:  Specify  a  sufficient  number  of  moment  conditions.  Identify  the  p 
parameters  of  interest  0  and  specify  ni  (>  p)  moment  conditions  (4.61).  The 
crucial  assumption  is  that  the  DGP  satisfies  these  moment  conditions. 

(i continues ) 
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GMM  estimation  and  testing  ( continued ) 

•  Step  2:  Estimate  the  parameters.  Estimate  9  by  GMM  by  solving  the 

equations  (4.62)  (if  m  =  p)  or  by  minimizing  (4.63)  (if  m>p).  The 
weighting  matrix  W  can  be  chosen  iteratively,  starting  with  W  =  I  and  (if 
Oh  is  the  estimate  obtained  in  the  dth  iteration)  choosing  in  the  [h  +  l)st 
iteration  W  =  Jh\  where ]h  =  1 8i(^h)g,i(h)- 

•  Step  3:  Compute  the  GMM  standard  errors.  The  asymptotic  covariance 
matrix  of  the  GMM  estimator  8  can  be  obtained  from  (4.67)  and  (4.68). 
The  GMM  standard  errors  are  the  square  roots  of  the  diagonal  elements  of 
this  matrix. 

•  Step  4:  Test  of  moment  conditions  (in  over-identified  models).  The  correct¬ 
ness  of  the  moment  conditions  can  be  tested  in  the  over-identified  case 
( m  >  p)  by  the  /-test  in  (4.69). 


Exercises:  T:  4.7;  S:  4.11c,  f,  4.12c,  g. 


4.4.4  Quasi-maximum  likelihood 

Moment  conditions  derived  from  a  postulated  likelihood 

Considering  the  four  steps  of  GMM  at  the  end  of  the  last  section,  the 
question  remains  how  to  find  the  required  moment  conditions  in  step  1.  In 
some  cases  these  conditions  can  be  based  on  models  of  economic  behav¬ 
iour —  for  instance,  expected  utility  maximization.  Another  possibility  is  the 
so-called  quasi-maximum  likelihood  (QML)  method.  This  method  derives 
the  moment  conditions  from  a  postulated  likelihood  function,  as  in  (4.60).  It 
is  assumed  that  the  corresponding  moment  conditions 


E[g,(0)]  =  E[dl,/d6]  =  0 


hold  true  for  the  DGP,  but  that  the  likelihood  function  is  possibly  misspeci- 
fied.  This  means  that  the  expression  (4.35)  for  the  covariance  matrix  does  not 
apply.  The  (asymptotically)  correct  covariance  matrix  can  be  computed  by 
means  of  (4.67).  As  was  discussed  in  Section  4.4.3,  if  the  likelihood  function  is 
correct,  then  Hq  =  —Jo,  but  this  no  longer  holds  true  if  the  model  is  misspe- 
cified.  The  reason  is  that  the  equality  (4.36)  holds  true  only  at  0  =  do  —  that  is, 
for  correctly  specified  models.  On  the  other  hand,  the  results  in  (4.65)  and 
(4.66)  always  hold  true  as  long  as  the  moment  conditions  (4.61)  are  valid. 

Comparison  of  ML  and  QML 

So  in  QML  the  likelihood  function  is  used  only  to  obtain  the  first  order 
conditions  (4.60)  and  the  standard  errors  are  computed  from  (4.67).  QML  is 
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consistent  if  the  conditions  E[dl,/dO]  =  0  hold  true  for  the  DGP.  In  practice, 
when  one  is  uncertain  about  the  correct  specification  of  the  likelihood 
function,  it  may  be  helpful  to  calculate  the  standard  errors  in  both  ways, 
with  ML  and  with  QML.  If  the  outcomes  are  widely  different,  this  is  a  sign  of 
misspecification. 

Summary  of  QML  method 

In  quasi-maximum  likelihood,  the  parameter  estimates  and  their  standard 
errors  are  computed  in  the  following  way.  Here  it  is  assumed  that  the  n 
observations  (y„  xf)  are  mutually  independent  for  i  —  !,•••,«. 


Quasi-maximum  likelihood 

•  Step  1:  Specify  a  probability  distribution  for  the  observed  data.  Identify  the 
p  parameters  of  interest  0.  Postulate  a  probability  distribution  ply,,  x„  6)  for 
the  /th  observation,  and  let  f(9)  =  log  (p(y;,  xn  8))  be  the  contribution  of  the 
zth  observation  to  the  log-likelihood  log  (L(0))  =  Y',-i  log  (p(y<, $))• 

•  Step  2:  Derive  the  corresponding  moment  conditions.  Define  the  p  moment 
conditions  £[g,(0)]  =  0,  where  the  moments  are  defined  by 
gi(9)  =jjf),i  =  1,  ■  •  • ,  n.  The  crucial  assumption  is  that  the  DGP  satisfies 
these  moment  conditions. 

•  Step  3:  Estimate  the  parameters.  Estimate  9  by  solving  the  equations  (4.62) 
(as  m  =  p ,  there  is  no  need  for  a  weighting  matrix).  This  is  equivalent  to 
ML  estimation  based  on  the  chosen  probability  distribution  in  step  1. 

•  Step  4:  Compute  the  GMM  standard  errors.  Approximate  standard  errors  of 
the  QML  estimates  can  be  obtained  from  the  asymptotic  covariance  matrix 
in  (4.67)  and  (4.68),  with  g,(d)  =  h(9)  =  log(p(y„x;,  9)). 


Exercises:  E:  4.17h. 


4.4.5  GMM  in  simple  regression 

The  two  moment  conditions 

We  illustrate  GMM  by  considering  the  simple  regression  model.  The  results 
will  be  used  in  the  example  in  the  next  section.  Suppose  that  we  wish  to 
estimate  the  parameters  a  and  ft  in  the  model 

ji  =  a  +  fix,  +  £,-,  »=!,•••,«. 


We  suppose  that  the  functional  form  is  correctly  specified  in  the  sense  that  the 
DGP  has  parameters  (ocq,  fl0)  with  the  property  that 
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E[sj]  =  E[yi  -  a0  -  j30x(]  =  0,  i=l,---,n. 

Further  we  assume  that  the  explanatory  variable  x,  satisfies  the  orthogonality 
condition 


E[x&]  =  E[xj(y,  -oc0-  P0Xi)]  =  0,  /  =  1,  •  •  •  ,«. 

This  provides  two  moment  conditions,  so  that  the  model  is  exactly 
identified. 


The  GMM  estimators 

The  GMM  estimates  of  a  and  /?  are  obtained  by  replacing  the  expectation  E 
by  the  sample  mean  \  Yl'i= v  so  that 

-j  n  ^  n 

nJ2(y>-*- M  =  °»  - J2 x‘ (y< ~ & - &*) =  °- 

i=i  i=i 

These  equations  are  equivalent  to  the  two  normal  equations  (2.9)  and  (2.10) 
in  Section  2.1.2  (p.  82).  So  the  GMM  estimates  of  a  and  /?  are  the  OLS 
estimates  a  and  b. 


GMM  standard  errors  (allowing  for  heteroskedasticity) 

The  variance  of  the  estimators  a  and  b  was  derived  in  Section  2.2.4  (p.  96)  under 
Assumptions  1-6  (see  (2.27)  and  (2.28)).  The  above  two  moment  conditions 
correspond  to  Assumptions  1  (exogeneity),  2  (zero  mean),  and  5  and  6  (linear 
model  with  constant  parameters).  We  now  suppose  that  Assumption  4  (no  correl¬ 
ation)  is  also  satisfied,  but  that  Assumption  3  (homoskedasticity)  is  doubtful.  If 
Assumption  3  is  violated,  then  the  formulas  (2.27)  and  (2.28)  for  the  variances  of  a 
and  b  do  not  apply.  A  consistent  estimator  of  the  2x2  covariance  matrix  is 
obtained  from  (4.67).  In  our  case. 


gi(a,b) 


(  Ji  -  a  -  bxi  \ 
V*/(y<  ~  a  -  bxi) ) 


so  that  the  estimated  covariance  matrix  (4.67)  is  var(0)  =  (H'n  Jn  1H„ )  1  with 


ELn  — 

<= 1 


Jn  = 

i=l  i=  1 
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If  the  residuals  are  all  of  nearly  equal  magnitude  so  that  e]  «  er2,  then  we  obtain 
H„  =  — X'X  and  J„  «  a1  (X'X),  where  X  is  the  n  x  2  regressor  matrix.  The  formula 
(4.67)  then  gives  V  «  ff2(X'X)_1,  as  in  Chapter  3.  However,  if  the  residuals  differ 
much  in  magnitude,  then  J„  may  differ  considerably  from  a2(X'X)~  ,  and  the 
(correct)  GMM  expression  in  (4.67)  may  differ  much  from  the  (incorrect)  expres¬ 
sion  (T2(X'X)_1  for  the  covariance  matrix. 

Exercises:  E:  4.17g. 
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4.4.6  Illustration:  Stock  Market  Returns 

We  consider  once  again  the  excess  returns  data  for  the  sector  of  cyclical 
consumer  goods  (y,)  and  for  the  whole  asset  market  (x,)  in  the  UK  (see  also 
Examples  4.5  (p.  243-6)  and  4.7  (p.  251)).  We  will  discuss  (i)  the  data  and 
the  model  assumptions,  (ii)  two  estimation  methods,  OLS  and  QML  with 
(scaled)  t(5)-disturbances,  (iii)  correctness  of  the  implied  moment  conditions, 
(iv)  the  estimation  results,  and  (v)  tests  of  two  hypotheses. 

(i)  Data  and  model  assumptions 

The  data  set  consists  of  n  —  240  monthly  data  over  the  period  1980.01- 
1999.12.  The  CAPM  is  given  by 

yi  =  a  +  fix,  +  Si,  i  =  1,  •  •  ■ ,  n. 

We  make  the  following  assumptions  on  the  DGP.  The  disturbances  have 
mean  zero  (Assumption  2).  The  terms  x,  and  s,  are  independent,  so  that  in 
particular  E[x,st]  =  E[x,-]E[e,-]  =  0  (compare  with  Assumption  1).  The  dis¬ 
turbances  are  independent  (Assumption  4)  and  the  DGP  is  described  by  the 
above  simple  regression  model  for  certain  (unknown)  parameters  (ao,j30) 
(Assumptions  5  and  6).  However,  we  do  not  assume  normality  (Assumption 
7),  as  the  results  in  Example  4.4  (p.  223-4)  indicate  that  the  distribution  may 
have  fat  tails.  We  also  do  not  assume  homoskedasticity  (Assumption  3),  as 
the  variance  of  the  disturbances  may  be  varying  over  time  (see  Example  4.7). 
That  is,  we  assume  that  the  disturbances  e,  are  independently  distributed 
with  unknown  distributions  pi(s,)  with  mean  E[e(]  =  0  and  possibly  different 
unknown  variances  E[sf]  =  of,  i  =  1,  •  •  •,  n.  Further  we  assume  that  the 
density  functions  pi(sj)  are  symmetric  around  zero  in  the  sense  that 
pi(si)  =  pi(  —  s^,  that  is,  P[sj  >  c\  =  P[si  <  —c]  for  every  value  of  c. 

(ii)  Two  estimation  methods:  OLS  and  QML  with  (scaled)  t(5)-disturbances 

As  the  distribution  of  the  disturbances  is  unknown,  we  cannot  estimate 
the  parameters  a  and  P  by  maximum  likelihood.  We  consider  two 
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estimators,  least  squares  (OLS)  and  quasi-maximum  likelihood  (QML)  based 
on  the  (scaled)  t(5)-distribution  introduced  in  Example  4.5  (p.  244).  We 
compute  the  standard  errors  by  GMM  and  compare  the  outcomes  with 
those  obtained  by  the  conventional  expressions  for  OLS  and  ML  standard 
errors. 

(iii)  Correctness  of  moment  conditions  under  the  stated  assumptions 

Under  the  above  assumptions,  the  OLS  and  QML  estimators  are  consistent 
and  (asymptotic)  GMM  standard  errors  can  be  obtained  from  (4.67)  and 
(4.68),  provided  that  the  specified  moment  conditions  hold  true  for  the  DGP. 
Lor  OLS  this  follows  from  Assumptions  1  and  2,  as  discussed  in  Section 
4.4.5.  Lor  QML,  the  moment  conditions  are  given  by  (4.59) — that  is, 
E[dli/d9]m=Q0  =  0.  We  use  only  the  moments  for  a  and  /?,  with  first  order 
conditions  described  in  Example  4.5  (p.  244).  That  is, 

(  6e>  \ 

5c2  +  sf 

6x,e, 

\  5cr2  +  e2 

(in  QML  we  use  the  estimated  value  a2  =  4.49  obtained  in  Example  4.5).  It 
follows  from  Assumptions  1  and  2,  together  with  the  symmetry  of  the  den¬ 
sities  pi(Sj),  that  £[gpML(a o,/?o)]  =  0-  Therefore,  under  the  stated  assump¬ 
tions  the  moment  conditions  are  valid  for  both  estimation  procedures. 

(iv)  Estimation  results 

The  results  in  Exhibit  4.21  show  the  estimates  for  OLS  (Panels  1  and  2) 
and  QML  (Panels  3  and  4),  with  standard  errors  computed  both  in  the 
conventional  way  (see  Section  4.4.3,  by  means  of  Vols  in  Panel  1  and  Vml 
in  Panel  3)  and  by  means  of  GMM  as  in  (4.67)  (in  Panels  2  and  4).  Lor  OLS, 
the  matrices  Hn  and /„  in  (4.68)  were  derived  in  Section  4.4.5.  Lor  QML,  the 
matrices  H„  and  Jn  can  be  derived  from  the  above  expression  for  gpML. 
The  differences  between  the  OLS  and  QML  estimates  are  not  so  large,  and 
the  same  applies  for  the  standard  errors  (computed  in  four  different  ways 
for  a  and  j]).  Therefore,  the  effects  of  possible  heteroskedasticity  and  non- 
normality  of  the  disturbances  seem  to  be  relatively  mild  for  these  data.  The 
application  of  OLS  with  conventional  formulas  for  the  standard  errors  seems 
to  be  reasonable  for  these  data. 

(v)  Test  outcomes 

We  finally  consider  tests  for  the  hypothesis  that  a  =  0  against  the  alternative 
that  0,  and  also  for  /?  =  1  against  the  alternative  that  fi  ^  1.  Based  on  the 


?pML(ao,/?o)  = 
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(asymptotic)  normal  distribution,  the  P-values  of  the  test  outcomes  in 
Exhibit  4.21  are  as  follows: 

for  a  =  0  :  POLs  =  0.22,  ?g“M  =  0.19,  PML  =  0.32,  P™  =  0.30, 
for  p=l:  POLS  =  0.023,  Pgff  =  0.012,  PMl  =  0.008,  P^M  =  0.003. 


Panel  1:  Dependent  Variable:  RENDCYCO 

Method:  Least  Squares 

Sample:  1980:01  1999:12 

Included  observations:  240 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-0.447481 

0.362943 

-1.232924 

0.2188 

RENDMARK 

1.171128 

0.075386 

15.53500 

0.0000 

R-squared 

0.503480 

Panel  2:  Dependent  Variable:  RENDCYCO 

Method:  Generalized  Method  of  Moments 

Sample:  1980:01  1999:12 

Included  observations:  240 

Moment  Conditions:  normal  equations 

Variable 

Coefficient  Std.  Error 

t-Statistic 

Prob. 

C 

-0.447481  0.342143 

-1.307876 

0.1922 

RENDMARK 

1.171128  0.067926 

17.24135 

0.0000 

R-squared 

0.503480 

Panel  3:  Model:  RENDCYCO  =  C(l)  +  C(2)*RENDMARK  +  EPS 
EPS  are  IID  with  scaled  t(5)  distribution,  scale  parameter  is  C(3) 
Method:  Maximum  Likelihood  (BHHH) 

Sample:  1980:01  1999:12 
Included  observations:  240 
Convergence  achieved  after  19  iterations 


Parameter 

Coefficient 

Std.  Error 

z-Statistic 

Prob. 

C(l) 

-0.344971 

0.348223 

-0.990660 

0.3219 

C(2) 

1.196406 

0.073841 

16.20244 

0.0000 

C(3) 

4.494241 

0.271712 

16.54049 

0.0000 

Log  likelihood 

-747.6813 

Panel  4:  GMM  standard  errors 
a  ml  (C(l)in  Panel  3)  0.334173 

bML  (C(2)  in  Panel  3) _ 0.066475 

Exhibit  4.21  Stock  Market  Returns  (Section  4.4.6) 

Results  of  different  estimates  of  CAPM  for  the  sector  of  cyclical  consumer  goods,  estimated  by 
OLS  (Panel  1)  and  by  ML  (Panel  3,  using  the  scaled  t( 5)  distribution  for  the  disturbances).  For 
OLS  the  standard  errors  are  computed  in  two  ways,  as  usual  (Panel  1,  using  the  expression 
s2(X'X)~1  of  Chapter  3)  and  by  means  of  GMM  (Panel  2,  using  the  normal  equations  of  OLS 
as  moment  conditions).  For  ML  the  standard  errors  are  also  computed  in  two  ways,  as  usual 
(Panel  3,  using  the  information  matrix  as  discussed  in  Section  4.3.3)  and  by  means  of  GMM 
(Panel  4,  using  the  first  order  conditions  for  the  maximum  of  the  log-likelihood  as  moment 
conditions). 
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The  four  computed  P-values  for  these  two  tests  all  point  in  the  same  direc¬ 
tion.  The  outcomes  suggest  that  we  should  reject  the  hypothesis  that  /?  =  1 
but  not  that  a  =  0.  The  conclusions  based  on  ML  are  somewhat  sharper  than 
those  based  on  OLS. 
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Summary,  further  reading, 
and  keywords 


SUMMARY 

In  this  chapter  we  considered  methods  that  can  be  applied  if  some  of  the 
assumptions  of  the  regression  model  in  Chapter  3  are  not  satisfied.  If  the 
regressors  are  stochastic  or  the  disturbances  are  not  normally  distributed, 
then  the  results  of  Chapter  3  are  still  valid  asymptotically  if  the  regressors  are 
exogenous.  If  the  model  is  non-linear  in  the  parameters,  then  the  least 
squares  estimator  has  to  be  computed  by  numerical  optimization  methods 
and  this  estimator  has  similar  asymptotic  properties  as  the  least  squares 
estimator  in  the  linear  model.  Maximum  likelihood  is  a  widely  applicable 
estimation  method  that  has  (asymptotically)  optimal  properties  —  that  is,  it 
is  consistent  and  it  has  minimal  variance  among  all  consistent  estimators. 
This  method  requires  that  the  joint  probability  distribution  of  the  disturb¬ 
ances  is  correctly  specified.  If  there  is  much  uncertainty  about  this  distribu¬ 
tion,  then  the  generalized  method  of  moments  can  be  applied.  In  this  case  the 
parameters  are  estimated  by  solving  a  set  of  moment  equations,  and  the 
standard  errors  are  computed  in  a  way  that  does  not  require  the  joint 
probability  distribution.  This  method  requires  that  the  specified  moment 
conditions  are  valid  for  the  data  generating  process. 


FURTHER  READING 

The  textbooks  mentioned  in  Chapter  3,  Further  Reading  (p.  178-9),  all  contain 
sections  on  asymptotic  analysis,  non-linear  methods,  maximum  likelihood,  and 
the  generalized  method  of  moments.  We  further  refer  in  particular  to  Davidson 
and  MacKinnon  (1993),  Gourieroux  and  Monfort  (1995),  and  Hayashi  (2000). 

Davidson,  R.,  and  MacKinnon,  J.  G.  (1993).  Estimation  and  Inference  in  Econo¬ 
metrics.  New  York:  Oxford  University  Press. 

Gourieroux,  C.,  and  Monfort,  A.  (1995).  Statistics  and  Econometric  Models. 

2  vols.  Cambridge:  Cambridge  University  Press. 

Hayashi,  F.  (2000).  Econometrics.  Princeton:  Princeton  University  Press. 
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Exercises 


THEORY  QUESTIONS 

4.1  (“©  Section  4.1.3) 

The  consistency  of  b  depends  on  the  probability 
limits  of  the  two  terms  -X'X  and  -X'e. 

n  n 

a.  Investigate  this  consistency  for  nine  cases, 
according  to  whether  these  limits  are  zero,  finite, 
or  infinite. 

b.  Give  examples  of  models  where  plim(lX'X)  is 
zero,  finite,  and  infinite.  Give  an  intuitive  explan¬ 
ation  why  b  is  (in)consistent  in  these  cases. 

4.2  Section  4.1.5) 

Consider  the  data  generating  process  y,  =  x,  +  e, 
where  £,-  are  independently  normally  distributed 
N(0,  1)  random  variables.  For  simplicity  we  esti¬ 
mate  the  parameter  fl  =  1  by  regression  in  the 
model  without  constant  term  —  that  is,  in  the 
model  y,  =  /lx,  +  £,.  By  the  speed  of  convergence  of 
b  to  fl  we  mean  the  power  np  for  which  the  distribu¬ 
tion  of  np(b  —  jff)  does  not  diverge  and  also  does  not 
have  limit  zero  if  n  — >  oo.  Section  4.1  presented 
results  with  speed  of  convergence  s/n,  see  (4.7). 

a.  Let  x,  =  i.  Show  that  this  DGP  does  not  satisfy 
Assumption  1*.  Show  that  the  speed  of  conver¬ 
gence  is  ti\fn  in  this  case.  (It  may  be  helpful  to  use 
the  fact  that  ‘2  =  ln(n  +  1)(2«  +  1)-) 

b.  Now  let  x,  =  1/i.  Show  that  this  DGP  also  does 
not  satisfy  Assumption  1*.  Show  that  plim(£>) 
does  not  exist  in  this  case,  and  that  the  speed  of 
convergence  is  n° .  (It  may  be  helpful  to  use  the 
fact  that  (I/')2  =  \n2-) 

4.3  (=©  Section  4.1.3) 

Consider  the  linear  model  y  =  Xfl  +  s  with  stochas¬ 
tic  regressors  that  satisfy  Assumption  1*  and  with 
plim( IX'e)  =  (0,  ■  •  • ,  0,  p)' ,  so  that  only  the  last 
regressor  is  asymptotically  correlated  with  the 
error  term. 

a.  Show  that,  in  general,  b  is  inconsistent  with  re¬ 
spect  to  all  coefficients  of  the  vector  fl. 


b.  Under  which  condition  does  only  the  estimator 
of  the  last  coefficient  become  inconsistent?  Pro¬ 
vide  an  intuitive  explanation  of  this  result. 

4.4  (^  Section  4.1.3) 

Consider  the  model  with  measurement  errors, 
where  two  economic  variables  y*  and  x*  are  related 
by  y*  =  a.  +  fix*  and  where  the  measured  variables 
are  given  by  y  =  y*  +  ey  and  x  =  x*  +  sx.  The  vari¬ 
ances  of  the  measurement  errors  ey  and  ex  are  de¬ 
noted  by  <T“  and  a2  respectively.  It  is  assumed  that  sy 
and  ex  are  uncorrelated  with  each  other  and  that 
both  are  also  uncorrelated  with  the  variables  y* 
and  x*.  The  variance  of  x*  is  denoted  by  a2.  The 
observed  data  consist  of  n  independent  observations 
(x,,  y,),  /  =  1,  •  •  • ,  n.  For  simplicity,  assume  that  x*, 
ex  and  £y  are  all  IID  (identically  and  independently 
distributed). 

a.  Write  the  model  in  the  form  y  =  a  +  fix  +  s  and 
express  £  in  terms  of  Ey  and  ex. 

b.  Show  that  the  OLS  estimator  b  is  inconsistent  if 
a2x^0  and  fl  ±  0. 

c.  Express  the  magnitude  of  the  inconsistency  (that 
is,  plim(£>)  —  fl)  in  terms  of  the  so-called  signal- 
to-noise  ratio  var(x*)/var(£x)  =  o*/ox.  Explain 
this  result  by  means  of  two  scatter  diagrams, 
one  with  small  and  the  other  with  large  signal- 
to-noise  ratio. 

4.5  (^  Section  4.3.8) 

In  Section  4.3  it  was  discussed  that  the  LM-,  LR-, 
and  W-tests  are  asymptotically  distributed  as  x2(g), 
but  that  gF(g,  n  —  k )  can  also  be  used. 

a.  Show  that  for  n  —>  oo  there  holds  gF(g,  n  —  k)  — > 

x2(g)- 

b.  Check  that  the  P-values  (corresponding  to  the 
right  tail  of  the  distributions)  of  gF(g,  n  —  k)  are 
larger  than  those  of  X2(g),  by  using  a  statistical 
package  or  by  inspecting  tables  of  critical  values 
of  both  distributions. 
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c.  Comment  on  the  relevance  of  this  result  for  ap¬ 
plying  the  LM-,  LR-,  and  W-tests. 

4.6  Sections  4.2.4,  4.3.2,  4.3.8) 

a.  Prove  the  expression  (4.23)  for  the  relation 
between  the  LM- test  and  the  L-test  in  the 
linear  model  y  =  X3pl  +  X2/?2  +  e  for  the  null 
hypothesis  /?2  =  0.  It  may  be  helpful  to  prove  as 
a  first  step  that  the  numerator  of  R 2  in  (4.21) 
can  be  written,  with  M  =  I  —  X(X'X)~1X',  as 
eR(I  —  M)eR  =  e'Ren  —  e'RMeR  and  that  e'RMeR 
=  e'e. 

b.  Prove  the  inequalities  in  (4.56)  for  testing  a  linear 
hypothesis  in  the  linear  model  y  =  Xp  +  e.  For 
this  purpose,  make  use  of  the  expressions  (4.23), 


(4.46),  and  (4.49),  which  express  the  three  tests 
LM,  LR,  and  W  in  terms  of  the  L-test. 

c.  Show  the  statement  at  the  end  of  Section  4.3.2 
that  ML  in  a  non-linear  regression  model  with 
normally  distributed  disturbances  is  equivalent 
to  non-linear  least  squares. 

4.7*  Section  4.4.3) 

Prove  that  the  choice  of  weights  W  =  Jq  1  (with  the 
notation  of  Section  4.4)  minimizes  the  asymptotic 
covariance  matrix  V  in  (4.65)  of  the  GMM  estimator. 
Also  prove  that  this  choice  makes  the  GMM  estima¬ 
tor  invariant  with  respect  to  linear  transformations  of 
the  model  restrictions  (4.61)  —  that  is,  if  g,  is  replaced 
by  Agi,  where  A  is  an  m  x  m  non-singular  matrix. 


EMPIRICAL  AND  SIMULATION  QUESTIONS 

4.8  Section  4.1.3) 

Suppose  that  data  are  generated  by  the  process 
y,  =  ft  +  P2xn  +  P3X3  j  +  uj„  where  the  u),  are 
IID(0, a2)  disturbances  that  are  uncorrelated  with 
the  regressors  X2  and  X3.  Suppose  that  the  regressors 
X2  and  X3  are  positively  correlated.  An  investigator 
investigates  the  relation  between  y  and  X2  by  re¬ 
gressing  y  on  a  constant  and  X2  —  that  is,  X3  is 
omitted.  The  estimator  of  /?2  in  this  restricted 
model  is  denoted  by  fif,  and  the  estimator  of  <r2  in 
this  model  is  denoted  by  s|. 

a.  Investigate  whether  is  an  unbiased  and/or 
consistent  estimator  of  P2. 

b.  Also  argue  whether  or  not  s|  will  be  an  unbiased 
and/or  consistent  estimator  of  <r2. 

c.  Construct  a  data  generating  process  that  satisfies 
the  above  specifications.  Generate  samples  of 
sizes  n  =  10,  n  =  100,  and  n  =  1000  of  this  pro¬ 
cess. 

d.  Compute  the  estimates  fif  and  s|  for  the  sample 
sizes  n  =  10,  n  =  100,  and  n  =  1000.  Compare 
these  outcomes  with  the  results  in  a  and  b. 

4.9  Section  4.2.3) 

a.  Generate  a  sample  of  size  100  from  the  model 
y,  =  2+  sjxi  +  £,-,  where  the  x,  are  independent 
and  uniformly  distributed  on  the  interval  [0,  20] 
and  the  e,  are  independent  and  distributed  as 
N(0,  0.01). 


b.  Consider  the  non-linear  regression  model 
y  =  f(x,  P)  +  e  with  f(x,  P)  =  P\+  P2X ^3.  Deter¬ 
mine  the  3x1  vector  of  gradients  g  =  df/dp  of 
this  model. 

c.  Perform  twenty  steps  of  the  Gauss-Newton 
method  to  estimate  P,  with  starting  values 
P  =  (0,  1,  1)'.  Plot  the  three  resulting  series  of 
twenty  estimates  of  pu  p2,  and  /13. 

d.  Now  take  as  starting  values  P  =  (0,  1,  0)'.  Ex¬ 
plain  the  problems  that  arise  in  this  case. 

e.  With  the  final  estimate  in  c,  perform  an  L-test  of 
the  hypothesis  that  P3  =  1/2.  Perform  also  an 
LM-test  of  this  hypothesis. 

4.10  (“©  Sections  4.2.4,  4.3.7) 

Consider  the  DGP  y,  =  1  +  xf  +  w,  with 
Xj  ~  NID(0,  1)  and  u>j  ~  NID(0,  1)  and  with 
(xi,  •  •  • ,  x„)  independent  from  (coi,  ■  ■  ■ ,  uin).  The  es¬ 
timated  model  is  y,  =  P3+  P2Xj  +  P3xj  +  r, — that 
is,  with  additional  regressor  x,.  Let  fi2  denote  the 
least  squares  estimate  of  P2  in  this  model. 

a.  Generate  two  samples  of  this  model,  one  of  size 
n  =  10  and  another  of  size  n  =  100.  Determine 
fi2  and  the  standard  error  of  fi2  for  these  two 
samples. 

b.  Perform  an  LM-test  for  the  null  hypothesis  that 
P2  =  0  against  the  alternative  that  p2^  0  for  the 
two  samples  of  a.  Use  a  5%  significance  level  (the 
5  %  critical  value  of  the  x2  ( 1 )  distribution  is  3 . 84 ) . 
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c.  Repeat  a  and  b  1000  times,  drawing  new  values 
for  x,  and  u,  in  each  simulation  run.  Make  histo¬ 
grams  of  the  resulting  1000  values  of  £>2  and  of 
the  LM- test,  both  for  n  =  10  and  for  n  =  100. 

d.  What  is  the  standard  deviation  of  the  1000  out¬ 
comes  of  bi  for  w  =  10?  And  for  n  =  100?  How 
does  this  compare  with  the  standard  errors  in  a? 

e.  How  many  of  the  1000  computed  LM-values  are 
larger  than  3.84  for  n  =  10?  And  for  n  =  100? 
Comment  on  the  outcomes. 

f.  Compute  the  asymptotic  distribution  (4.7)  for  the 
parameter  vector  /?  =  (fix ,  /?2,  fc)/  of  this  DGP. 
What  approximation  does  this  provide  for  the 
standard  error  of  h2?  How  does  this  compare 
with  the  results  in  d? 


e.  Now  suppose  that  the  researcher  is  so  lucky  to 
postulate  the  f(3)  distribution  for  the  disturb¬ 
ances.  Perform  the  corresponding  Wald  test  of 
the  hypothesis  that  the  population  mean  is  zero. 

f.  Discuss  which  method  (of  the  ones  used  in  b-e) 
the  researcher  would  best  use  if  he  or  she  does 
not  know  the  DGP  and  is  uncertain  about  the 
correct  disturbance  distribution. 

4.12  (”®  Sections  4.3.2,  4.4.3) 

In  this  exercise  we  consider  a  simulated 
data  set  of  sample  size  n  =  50.  The  data 
were  generated  by  the  model 

y,  =  0.5  +  x,  +  >/,,  1=1,  ■••,50, 


o*rA% 
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4.11  (“®  Sections  4.3.5,  4.4.3) 

In  this  simulation  exercise  we  generate  a  random 
sample  by  means  of  y,  =  /i  +  £,,  where  n  =  \  and 
the  disturbances  e,  are  independently  and  identically 
distributed  with  the  £(3)  distribution.  A  researcher 
who  does  not  know  the  DGP  is  interested  in  testing 
the  hypothesis  that  the  observations  come  from  a 
population  with  mean  zero.  This  hypothesis  is,  of 
course,  not  correct,  as  the  DGP  has  mean  1. 

a.  Simulate  a  set  of  n  =  50  data  from  this  DGP. 
Make  a  histogram  of  the  simulated  data  set. 

b.  The  researcher  tests  the  null  hypothesis  of  zero 
mean  by  means  of  the  conventional  (least  squares 
based)  f-test.  Perform  this  test.  What  is  the  com¬ 
puted  standard  error  of  the  sample  mean?  What 
is  the  true  standard  deviation  of  the  sample 
mean?  What  is  your  conclusion? 


c. 


d. 


Suppose  now  that  the  researcher  uses  GMM, 
based  on  the  moment  condition  E[y,  —  /t]  =  0  for 
i=  1,  -  •  - ,  n.  What  is  the  estimated  mean?  What  is 
the  corresponding  GMM  standard  error?  Give  a 
formal  proof,  based  on  (4.67)  and  (4.68),  of  the 
fact  that  in  the  current  model  the  GMM  standard 
error  is  equal  to  the  conventional  OLS  standard 
error  multiplied  by  the  factor  J 1  —  L 


Now  the  researcher  postulates  the  Cauchy  distri¬ 
bution  (that  is,  the  f ( 1 )  distribution)  for  the  dis¬ 
turbances.  Using  this  distribution,  compute  the 
corresponding  ML  estimate  of  ^  and  perform  the 
Wald  test  on  the  hypothesis  that  /<  =  0.  What  is 
the  computed  standard  error  of  this  ML  estima¬ 
tor  of  /t?  Why  is  this  not  the  true  standard  error 
of  this  estimator? 


where  the  regressors  x,  are  IID  with  uniform  distri¬ 
bution  on  the  interval  0  <  x  <  2  and  the  ij,  are  IID 
with  f(3)  distribution.  The  estimated  model  is 


y,  =  a  +  [>Xj  +  £,-, 


so  that  the  correct  parameter  values  of  the  DGP  are 
a  =  0.5  and  fi  =  1.  In  answering  the  following 
questions,  give  comments  on  all  the  outcomes. 

a.  Estimate  the  parameters  a.  and  fi  by  means  of 
OLS. 

b.  Make  a  scatter  plot  of  the  data  and  make  a 
histogram  of  the  OLS  residuals  obtained  in  a. 

c.  Determine  the  GMM  standard  errors  of  the  OLS 
estimates  of  a  and  /I. 

d.  Estimate  the  parameters  a  and  fi  by  ML,  using 
the  (incorrect)  Cauchy  distribution  for  the  dis¬ 
turbances  £,.  The  density  of  the  Cauchy  distribu¬ 
tion  (that  is,  the  f(l)  distribution)  is 

^  =  7[TTZj- 

e.  Estimate  the  parameters  a  and  /I  now  by  ML 

using  the  (incorrect)  t(5)  distribution  with  dens¬ 
ity^,) - L 


oc 


(i+k?) 


Finally  estimate  the  parameters  a  and  fi  by  ML 
using  the  correct  £(3)  distribution  with  density 

f(Ei)  oc  |  2. 

(1+K) 

Explain  why  GMM  provides  no  help  here  to  get 
a  clear  idea  of  the  slope  parameter  fi.  Also  explain 
why  the  (incorrect)  ML  estimates  in  d  and  e 
perform  quite  well  in  this  case.  State  your  overall 
conclusion  for  estimating  models  for  data  that 
are  scattered  in  a  way  as  depicted  in  b. 
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4.13  (”©  Section  4.3.8) 

Consider  the  n  =  12  data  for  coffee  sales 
of  brand  2  in  Section  4.2.5.  Let 
y  =  log  (q)  denote  the  logarithm  of  quan¬ 
tity  sold  and  x  =  log  (d)  the  logarithm  of  the  deal 
rate.  Two  econometricians  (A  and  B)  estimate  dif¬ 
ferent  models  for  these  data  —  namely, 

A:  y  =  a  +  [lx  +  £,  B:  y  =  y(l  +  <5x)  +  e. 

The  least  squares  estimates  of  a  and  /?  are  denoted  by 
a  and  b,  the  (non-linear)  least  squares  estimates  of  y 
and  <5  by  c  and  d.  In  the  tests  below  use  a  significance 
level  of  5%. 

a.  Give  a  mathematical  proof  of  the  fact  that  c  =  a 
and  d  =  b/a. 

b.  Perform  the  two  regressions  and  check  that  the 
outcomes  satisfy  the  relations  in  a. 

c.  Test  the  hypothesis  that  5  =  1  by  a  Wald  test. 

d.  Test  this  hypothesis  also  by  a  Lagrange  Multi¬ 
plier  test. 

e.  Test  this  hypothesis  also  by  a  Likelihood  Ratio  test. 

f.  Test  this  hypothesis  using  the  model  of  econo¬ 
metrician  A. 

4.14  (“©  Section  4.3.8) 

In  this  exercise  we  consider  the  bank 
wage  data  and  the  model  discussed  before 
in  Section  3.4.2.  Here  the  logarithm  of 
yearly  wage  (y)  is  explained  in  terms  of  education 
(xx),  logarithm  of  begin  salary  (*3),  gender  (*4),  and 
minority  (*5),  by  the  model 

y  =  Pi  +  +  P3X3  +  P4X4  +  P5X5  +  £. 

The  data  set  consists  of  observations  for  n  =  474 
individuals.  Apart  from  the  unrestricted  model  we 
consider  three  restricted  models  —  that  is,  (i)  /?5  =  0, 
(ii)  /I4  =  P5  =  0,  (iii)  P4  +  Ps  —  0.  For  all  tests 
below,  compute  the  relevant  (asymptotic)  P- values. 
It  is  assumed  that  the  error  terms  s  are  NID(0,cr2). 

a.  For  each  of  the  four  models,  compute  the  SSR  and 
the  ML  estimate  s^L  of  the  disturbance  variance. 

b.  Compute  the  log-likelihoods  (4.30)  of  the  four 
models.  Perform  LR- tests  for  the  three  restricted 
models  against  the  unrestricted  model. 

c.  Perform  Wald  tests  for  the  three  restricted 
models  against  the  unrestricted  model. 
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d.  Perform  also  LM-tests  for  the  three  restricted 
models  against  the  unrestricted  model,  by 
means  of  auxiliary  regressions  (4.55). 

e.  Compare  the  outcomes  with  the  ones  obtained  in 
Section  3.4.2. 

4.15  (“©  Section  4.3.8) 

Use  the  same  bank  wage  data  set  as  in 
Exercise  4.14.  Now  assume  that  we 
accept  the  hypothesis  that  p5  =  0  and 
that  we  wish  to  test  the  hypothesis  that  P4  =  0, 
given  that  p5  =  0.  In  the  notation  of  Exercise  4.14, 
we  test  the  restricted  model  (ii)  against  the  alterna¬ 
tive  ‘unrestricted’  model  (i).  For  the  tests  below, 
compute  the  relevant  (asymptotic)  P-values. 

a.  Perform  conventional  t-  and  E-tests  for  this  hy¬ 
pothesis. 

b.  Compute  also  the  LR-,  W-,  and  LM-tests  for  this 
hypothesis. 

c.  Use  these  outcomes  to  discuss  the  difference  be¬ 
tween  joint  testing  of  multiple  restrictions  (as  in 
Exercise  4.14  with  the  joint  model  restriction  (ii) 
tested  against  the  full  model  with  all  five  regres¬ 
sors)  and  sequential  testing  of  single  hypotheses 
(as  in  the  current  exercise).  In  particular,  consider 
the  differences  if  one  uses  a  significance  level  of 
2.5%  in  all  tests. 

4.16  Section  4.3.8) 

In  this  exercise  we  consider  the  food  ex¬ 
penditure  data  on  food  consumption  ( fc , 
measured  in  $10,000  per  year),  total  con¬ 
sumption  (tc,  also  measured  in  $10,000  per  year), 
and  average  household  size  (hs)  that  were  discussed 
in  Example  4.3  (p.  204-5).  As  dependent  variable 
we  take  y  =  fc/tc,  the  fraction  of  total  consumption 
spent  on  food,  and  as  explanatory  variables  we 
take  (apart  from  a  constant  term)  xz  =  tc  and 
x3  =  hs.  The  estimated  model  is  of  the  form 
Vi  =  f{xxi,  x3i,  fi)  +  eh  where  f(x2i,  x3i,  P)  =  Pi+ 
Pxx2i  +  P4X3,.  We  will  consider  three  hypotheses  for 
the  parameter  /?3  —  namely,  p3  =  0  (so  that  x2  has  no 
effect  on  y),  P3  =  1  (so  that  the  marginal  effect  of  x2 
on  y  is  constant),  and  P3  =\  (so  that  the  marginal 
effect  of  xx  on  y  declines  for  higher  values  of  x2). 

a.  Exhibit  4.6  shows  a  scatter  diagram  of  y  against 
x2 .  Discuss  whether  you  can  get  any  intuition 
from  this  diagram  concerning  the  question 
which  of  the  hypotheses  P3  =  0,  P3  =  1,  and 
1 83  =  j  could  be  plausible. 
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b.  For  /?3  =  0  the  parameters  {jSl5  /?2,  P4)  of  the 
model  are  not  identified.  Prove  this.  What  refor¬ 
mulation  of  the  restricted  model  (for  /13  =  0)  is 
needed  to  get  identified  parameters  in  this  case? 

c.  Estimate  the  unrestricted  model  with  four  regres¬ 
sion  parameters.  Try  out  different  starting  values 
and  pay  attention  to  the  convergence  of  the  esti¬ 
mates. 

d.  Test  the  three  hypotheses  (/?3  =  0,  /?3  =  1,  and 
j03  =  j )  by  means  of  t-tests. 

e.  Test  these  three  hypotheses  also  by  means  of 
F-tests. 

f.  Now  test  the  three  hypotheses  by  means  of 
LR- tests. 

g.  Test  the  three  hypotheses  by  means  of  LM-tests, 
using  the  result  (4.21)  with  appropriate  auxiliary 
regressions.  The  regressors  in  step  2  of  the  LM- 
test  consist  of  the  four  partial  derivatives  for 
/  =  1,  2,  3,  4. 

h.  Test  the  three  hypotheses  by  means  of  the  Wald 
test  as  expressed  in  (4.48).  Formulate  the  param¬ 
eter  restriction  respectively  as  r(9)  =  /?3  =  0, 
r(9)  =  /i3  -1=0,  and  r(9)  =  /?3  -  \  =  0. 

i.  Test  the  three  hypotheses  again  by  means  of  the 
Wald  test  as  expressed  in  (4.48),  but  now  with 
the  parameter  restriction  formulated  as  respect¬ 
ively  r(9)  =  j83  =  0,  r(9)  =  /?f  —  1  =  0,  and  r(9)  = 
^3-1=0- 

j.  Compare  the  outcomes  of  the  foregoing  six 
testing  methods  (in  d-i)  for  the  three  hypotheses 
on  ^3.  Comment  on  the  similarities  and  differ¬ 
ences  of  the  test  outcomes. 

4.17  (-©  Sections  4.3.3,  4.4.4,  4.4.5) 

In  this  exercise  we  consider  the  stock 
market  returns  data  for  the  sector  of 
non-cyclical  consumer  goods  in  the  UK. 

The  model  is  y-,  =  a  +  /be,  +  £,,  where  y,  are  the 
excess  returns  in  this  sector  and  are  the  market 
excess  returns.  The  monthly  data  are  given  for 
1980-99,  giving  n  =  240  observations.  The  disturb- 
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ances  e,  are  assumed  to  be  IID  distributed,  either 
with  normal  distribution  N(0,  a1)  or  with  the  Cau¬ 
chy  distribution  with  density  /(£,)  =  (7t(l  +  £?))  . 

a.  Determine  the  log-likelihood  for  the  case  of 
Cauchy  disturbances.  Show  that  the  ML  esti¬ 
mates  for  a  and  /?  are  obtained  from  the 
two  conditions  Yl  e/(  1  +  £^)_1  =  0  and 

E£>*<(i  +  sf)~1  =  °- 

b.  Estimate  a  and  /I  by  ML,  based  on  the  Cauchy 
distribution.  Determine  also  the  (asymptotic) 
standard  errors  of  these  estimates. 

c.  Estimate  a  and  ft  by  ML,  based  on  the  normal 
distribution.  Compute  also  the  standard  errors 
of  these  estimates.  Compare  the  results  with 
those  obtained  by  OLS. 

d.  Test  the  hypothesis  that  a  =  0  using  the  results 
in  b.  Test  this  hypothesis  also  using  the  results  in 
c.  Use  a  5%  significance  level. 

e.  Answer  the  questions  in  d  also  for  the  hypoth¬ 
esis  that  /I  =  1.  Again  use  a  5%  significance 
level. 

f.  Determine  the  two  histograms  of  the  residuals 
corresponding  to  the  estimates  in  b  and  c.  On 
the  basis  of  this  information,  which  of  the  two 
estimation  methods  do  you  prefer?  Motivate 
your  answer. 

g.  Compute  GMM  standard  errors  of  the  esti¬ 
mates  in  c  —  that  is,  the  estimates  based  on  the 
two  moment  conditions  £[e,]  =  0  and 
E[e jXj]  =  0.  How  does  this  compare  with  the 
(ordinary)  standard  errors  computed  in  c? 
Does  this  alter  your  answers  in  d  and  e  to  test 
respectively  whether  a.  =  0  and  /?  =  1  ? 

h*.  Finally,  consider  the  QML  estimates  based  on 
the  two  Cauchy  moment  conditions  defined  by 
E[  j  J  =  0  and  E[j^]  =  0.  Determine  the 

GMM  standard  errors  of  these  estimates  and 
perform  the  two  tests  of  d  and  e.  Compare  the 
outcomes  (standard  errors  and  test  results)  with 
the  outcomes  in  b,  d,  and  e  based  on  the  Cauchy 
ML  standard  errors. 
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Diagnostic  Tests  and 
Model  Adjustments 


In  this  chapter  we  describe  methods  to  test  the  assumptions  of  the  regression 
model.  If  some  of  the  assumptions  are  not  satisfied  then  there  are  several 
ways  to  proceed.  One  option  is  to  use  least  squares  and  to  derive  the  proper¬ 
ties  of  this  estimator  under  more  general  conditions.  Another  option  is  to 
adjust  the  specification  of  the  model  —  for  instance,  by  changing  the  included 
variables,  the  functional  form,  or  the  probability  distribution  of  the  disturb¬ 
ance  terms.  We  discuss  alternative  model  specifications,  including  non-linear 
models,  disturbances  that  are  heteroskedastic  or  serially  correlated,  and  the 
use  of  instrumental  variables. 

Most  of  the  sections  of  this  chapter  can  be  read  independently  of  each 
other.  We  refer  to  Exhibit  0.3  (p.  8)  for  the  sections  of  this  chapter  that  are 
needed  for  selected  topics  in  Chapters  6  and  7. 
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5.1  Introduction 


Modelling  in  practice 

It  is  the  skill  of  econometricians  to  use  economic  theory  and  statistical  data  in 
order  to  construct  econometric  models  that  provide  an  adequate  summary  of 
the  available  information.  In  most  situations  the  relevant  theoretical  infor¬ 
mation  is  of  a  qualitative  nature,  suggesting  which  economic  variables  play  a 
role  and  perhaps  whether  variables  are  positively  or  negatively  related.  Most 
models  from  economic  theory  describe  a  part  of  the  economy  in  isolation 
from  its  environment  (the  ceteris  paribus  assumption).  This  means  that  the 
empirical  modeller  is  faced  with  the  following  two  questions.  How  should 
the  relationships  between  the  variables  of  interest  be  specified,  and  how 
should  the  other  influences  be  taken  into  account? 

In  practice  it  often  occurs  that  an  initially  chosen  econometric  model  does 
not  fit  well  to  the  data.  This  may  happen  despite  genuine  efforts  to  use 
economic  theory  and  to  collect  data  that  are  relevant  for  the  investigation 
at  hand.  The  model  may  turn  out  to  be  weak,  for  instance,  because  important 
aspects  of  the  data  are  left  unexplained  or  because  some  of  the  basic  assump¬ 
tions  underlying  the  econometric  model  are  violated.  Examples  of  the  latter 
are  that  the  residuals  may  be  far  from  normal  or  that  the  parameter  estimates 
may  differ  substantially  in  subsamples.  If  the  model  is  not  correctly  specified, 
there  are  various  avenues  to  take,  depending  on  the  degree  of  belief  one  has  in 
the  employed  model  structure  and  in  the  observed  data.  In  this  book  we 
describe  econometric  modelling  from  an  applied  point  of  view  where  we  start 
from  the  data.  We  consider  models  as  constructs  that  we  can  change  in  the 
light  of  the  data  information.  By  incorporating  more  of  the  relevant  data 
characteristics  in  the  model,  we  may  improve  our  understanding  of  the 
underlying  economic  processes.  The  selection  and  adjustment  of  models  are 
guided  by  our  insight  in  the  relevant  economic  and  business  phenomena.  As 
economic  theory  does  not  often  suggest  explicit  models,  this  leaves  some 
freedom  to  choose  the  model  specification.  Several  diagnostic  tests  have  been 
developed  that  help  to  get  clear  ideas  about  which  features  of  the  model  need 
improvement. 

This  view  on  econometric  modelling  differs  from  a  more  traditional  one 
that  has  more  confidence  in  the  theory  and  the  postulated  model  and  less  in 
the  observed  data.  In  this  view  econometrics  is  concerned  with 
the  measurement  of  theoretical  relations  as  suggested  by  economic  theory. 
In  our  approach,  on  the  other  hand,  we  are  not  primarily  interested  in  testing 
a  particular  theory  but  in  using  data  to  get  a  better  understanding  of  an 
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observed  phenomenon  of  interest.  The  major  role  of  tests  is  then  to  find  out 
whether  the  chosen  model  is  able  to  represent  the  main  characteristics  of 
interest  of  the  data. 

Diagnostic  tests 

In  econometrics  we  use  empirical  data  to  improve  our  understanding  of 
economic  processes.  The  regression  model, 


y  =  X/i  +  £, 


discussed  in  Chapter  3  is  one  of  the  standard  tools  of  analysis.  This  is  a  nice 
tool  as  it  is  simple  to  apply  and  it  gives  reliable  information  if  the  assumptions 
of  Chapter  3  are  satisfied.  Several  tests  are  available  to  test  whether  the 
proposed  model  is  correctly  specified.  Such  tests  of  the  underlying  model 
assumptions  are  called  missp education  tests.  Because  the  purpose  of  the 
analysis  is  to  make  a  diagnosis  of  the  quality  of  the  model,  this  is  also  called 
diagnostic  testing.  Like  a  medical  doctor,  the  econometrician  tries  to  detect 
possible  weaknesses  of  the  model,  to  diagnose  possible  causes,  and  to  pro¬ 
pose  treatments  (model  adjustments)  to  end  up  with  a  ‘healthy’  model.  Such 
a  model  is  characterized  by  the  fact  that  it  provides  insight  into  the  problem 
at  hand  and  that  it  shows  acceptable  reactions  to  relevant  diagnostic  tests. 

The  regression  model  y  =  X/f  +  e  was  analysed  in  Chapter  3  under  the 
seven  assumptions  stated  in  Section  3.1.4  (p.  125-6).  All  these  assumptions 
will  be  subjected  to  diagnostic  tests  in  this  chapter.  In  Section  5.2  we  test  the 
specification  of  the  functional  form  —  that  is,  the  number  of  included  ex¬ 
planatory  variables  in  X  and  the  way  they  enter  into  the  model  (Assumptions 
2  and  6).  Section  5.3  considers  the  possibility  of  non-constant  parameters  /? 
(Assumptions  5  and  2).  Next  we  examine  the  assumptions  on  the  disturbance 
terms  £  and  we  discuss  alternative  estimation  methods  in  the  case  of  hetero- 
skedasticity  (Assumption  3,  in  Section  5.4),  serial  correlation  (Assumption  4, 
in  Section  5.5),  and  non-normal  distributions  (Assumption  7,  in  Section  5.6). 
Finally,  in  Section  5.7  we  consider  models  with  endogenous  regressors  in  X, 
in  which  case  the  orthogonality  condition  of  Section  4.1.3  (p.  194)  is 
violated. 

The  empirical  cycle  in  model  construction 

In  practice,  econometric  models  are  formed  in  a  sequence  of  steps.  First  one 
selects  the  relevant  data,  specifies  an  initial  model,  and  chooses  an  estimation 
method.  The  resulting  estimated  model  is  subjected  to  diagnostic  tests.  The 
test  outcomes  can  help  to  make  better  choices  for  the  model  and  the  estima¬ 
tion  method  (and  sometimes  for  the  data).  The  new  model  is  again  subjected 
to  diagnostic  tests,  and  this  process  is  repeated  until  the  final  model  is 
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Exhibit  5.1  The  empirical  cycle  in  econometric  modelling 


satisfactory.  This  process  of  iterative  model  specification  and  testing  is  called 
the  empirical  cycle  (see  Exhibit  5.1). 

This  sequential  method  of  model  construction  has  implications  for  the 
interpretation  of  test  outcomes.  Tests  are  usually  performed  under  the  as¬ 
sumption  that  the  model  has  been  correctly  specified.  For  instance,  the 
computed  standard  errors  of  estimated  coefficients  and  their  P-values  depend 
on  this  assumption.  In  practice,  in  initial  rounds  of  the  empirical  cycle  we 
may  work  with  first-guess  models  that  are  not  appropriately  specified.  This 
may  lead,  for  instance,  to  underestimation  of  the  standard  errors.  Also  in  this 
situation  diagnostic  tests  remain  helpful  tools  to  find  suitable  models.  How¬ 
ever,  one  should  not  report  P-values  without  providing  the  details  of  the 
search  process  that  has  led  to  the  finally  chosen  model. 

At  this  point  we  mention  one  diagnostic  tool  that  is  of  particular  import¬ 
ance —  namely,  the  evaluation  of  the  predictive  quality  of  proposed  models. 
It  is  advisable  to  exclude  a  part  of  the  observed  data  in  the  process  of  model 
construction.  The  excluded  data  are  called  the  hold-out  sample.  It  is  then 
possible  to  investigate  whether  the  final  model  that  is  obtained  in  the  empir¬ 
ical  cycle  is  able  to  predict  the  outcomes  in  this  hold-out  sample.  This 
provides  a  clear  test  of  model  quality,  irrespective  of  the  way  the  model  has 
been  obtained.  Forecast  evaluation  as  a  diagnostic  tool  will  be  further 
discussed  in  Sections  5.2.1  (p.  280)  and  7.2.4  (p.  570). 
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5.2.1  The  number  of  explanatory  variables 

How  many  variables  should  be  included? 

Assume  that  a  set  of  explanatory  variables  has  been  selected  as  possible 
determinants  of  the  variable  y.  Even  if  one  is  interested  in  the  effect  of  only 
one  of  these  explanatory  variables  —  say,  X2 —  it  is  of  importance  not  to 
exclude  the  other  variables  a  priori.  The  reason  is  that  variation  in  the 
other  variables  may  cause  variations  in  the  variable  y,  and,  if  these  variables 
are  excluded  from  the  model,  then  all  the  variations  in  y  will  be  attributed  to 
the  variable  xi  alone.  On  the  other  hand,  the  list  of  possibly  influential 
variables  may  be  very  long.  If  all  these  variables  are  included,  it  may  be 
impossible  to  estimate  the  model  (if  the  number  of  parameters  becomes 
larger  than  the  number  of  observations)  or  the  estimates  may  become  very 
inefficient  (owing  to  a  lack  of  degrees  of  freedom  if  there  are  insufficient 
observations  available).  The  question  then  is  how  many  variables  to  include 
in  the  model. 

Suppose  that  we  want  to  estimate  the  effects  of  a  set  of  (k  —  g)  variables  X\ 
on  the  dependent  variable  y,  and  that  in  addition  another  set  of  g  variables 
X2  is  available  that  possibly  also  influence  the  dependent  variable  y.  The 
effects  of  X\  on  y  can  be  estimated  in  the  model  y  =  Xi/i1  +  £,  with  estimator 
bn  =  (X'1Xi)_1X'1y.  An  alternative  is  to  include  the  variables  X2  and  to 
perform  a  regression  in  the  model  y  =  +  X2P2  +  e5  with  corresponding 

estimators  (b  1,  62)  of  (/?l5  /?2).  Which  estimator  of  bi  should  be  preferred,  b\ 
or  bji  ? 


Trade-off  between  bias  and  efficiency 

The  answer  to  the  above  question  is  easy  if  fi2  =  0.  In  Section  3.2.4  (p.  144) 
we  showed  that  the  inclusion  of  irrelevant  variables  leads  to  a  loss  in 
efficiency.  More  precisely,  under  Assumptions  1-6  and  with  /f2  =  0  there 
holds  E[b  1]  =  E[b$]  =  /?l5  so  that  both  estimators  are  unbiased,  and  var(£q) 
>  var {bs)  (in  the  sense  that  var(fci)  —  var(^R)  is  positive  semidefinite).  On  the 
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other  hand,  if  fl2  7^  0,  then  the  situation  is  more  complicated.  In  Section  3.2.3 
(p.  142-3)  we  showed  that  by  deleting  variables  we  obtain  an  estimator  that 
is  in  general  biased  (so  that  £[6r]  7^  Pi),  but  that  it  has  a  smaller  variance 
than  the  unbiased  estimator  b\  (so  that  var(h«)  <  var(f?i)).  The  question  then 
becomes  whether  the  gain  in  efficiency  is  large  enough  to  justify  the  bias  that 
results  from  deleting  X2.  The  fact  that  restrictions  improve  the  efficiency  is 
one  of  the  main  motivations  for  modelling,  but  of  course  the  restrictions 
should  not  introduce  too  much  bias.  If  many  observations  are  available, 
then  it  is  better  to  start  with  a  model  that  includes  all  variables  that 
are  economically  meaningful,  as  deleting  variables  gives  only  a  small  gain 
in  efficiency. 


A  prediction  criterion  and  relation  with  the  F-test 

A  possible  criterion  to  find  a  trade-off  between  bias  and  variance  is  the  mean 
squared  error  (MSE)  of  an  estimator  [1  of  ft,  defined  by 


MSE(^)  =  E[(0  -  m  -  fi)'\  =  var(j3)  +  (E\fi]  -  j 3)(E\fi\  -  /?)'. 

If  contains  more  than  one  component  then  the  MSE  is  a  matrix,  and 
the  last  equality  follows  by  using  the  definition  of  the  variance  var(/J)  = 
E[(fc  -  E[fc])(fc  -  £[/?])'].  A  scalar  criterion  could  be  obtained  by  taking  the  trace 
of  the  MSE  matrix.  However,  as  the  magnitude  of  the  individual  parameters  /?• 
depends  on  the  scales  of  measurement  of  the  individual  explanatory  variables  Xj, 
this  addition  of  squared  errors  (/f.  —  /i;)2  does  not  make  much  sense  in  general. 
Instead  we  consider  the  accuracy  of  the  prediction  y  =  X(i  of  the  vector  of  mean 
values  E[y]  =  Xp,  as  the  prediction  error  XfS  —  Xp  does  not  depend  on  the  scales 
of  measurement.  The  total  mean  squared  prediction  error  (TMSP)  is  defined  as  the 
sum  of  the  squared  prediction  errors  (y,  —  E[y,])2  — that  is, 


TMSP(/f)  =  E[(XP  -  Xp)'(Xp  -  Xp)\. 

We  can  apply  this  criterion  to  compare  the  predictions  y  =  X\b\  +  X2b2 
of  the  larger  model  with  the  predictions  yj<  =  X\bn  of  the  smaller  model.  It  is 
left  as  an  exercise  (see  Exercise  5.2)  to  show  that  TMSP(£r)  <  TMSP(£i)  if  and 
only  if 


PiV^fh  <  g, 


where  V2  is  the  covariance  matrix  of  b2  and  g  is  the  number  of  components  of  p2. 
So  the  restricted  estimator  bn  has  a  smaller  TMSP  than  the  unbiased  estimator  b\ 
if  fl2  is  sufficiently  small  and/or  the  variance  V2  of  the  estimator  b2  is  sufficiently 
large.  In  such  a  situation  it  is  also  intuitively  evident  that  it  is  better  to  reduce  the 
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uncertainty  by  eliminating  the  variables  X2  from  the  model.  In  practice,  f2  and  V2 
are  of  course  unknown.  We  can  replace  [:S2  and  V2  by  their  least  squares  estimates 
in  the  model  y  =  X  j/h  +  X2/f2  +  £.  That  is,  fi2  is  replaced  by  b2  and  V2  by 
V2  =  s2(X2MiX2)“  ,  where  s2  is  the  estimated  error  variance  in  the  unrestricted 
model  and  Mi  =  I  —  Xi(X'1Xi)_1X'1  (see  Section  3.4.1  (p.  161)).  We  can  prefer  to 
delete  the  variables  X2  from  the  model  if 

b'2V21b2/g=b'2X'2M1X2b2/(gs2)  <  1. 

According  to  the  result  (3.49)  in  Section  3.4.1,  this  is  equivalent  to  the  P-test  for 
the  null  hypothesis  that  f2  =  0  with  a  critical  value  of  1.  This  P-test  can  also  be 
written  as 


p  =  {e'ReR  -  e'e)/g 
e'e/(n  —  k) 


(5.1) 


where  and  e  are  the  residuals  of  the  restricted  and  unrestricted  model,  respect¬ 
ively.  The  critical  value  of  1  corresponds  to  a  size  of  more  than  5  per  cent  —  that  is, 
the  TMSP  criterion  used  in  this  way  is  more  liberal  in  accepting  additional 
regressors. 


The  information  criteria  of  Akaike  and  Schwarz 

Another  method  to  decide  whether  the  variables  X2  should  be  included  in  the 
model  or  not  is  to  use  information  criteria  that  express  the  model  fit  and  the 
number  of  parameters  in  a  single  criterion.  The  Akaike  information  criterion 
(AIC)  and  Schwarz  information  criterion  (SIC)  (also  called  the  Bayes  infor¬ 
mation  criterion  or  BIC)  are  defined  as  follows,  where  p  is  the  number  of 
included  regressors  and  Sp  is  the  maximum  likelihood  estimator  of  the  error 
variance  in  the  model  with  p  regressors: 

AIC(  p)  =  log  (s2)  +  — , 
n 

CTrvm  1  f  2\  ,  Pl° S(n) 

SIC(  p)  =  log  [Sp)  4 - - - • 

These  criteria  involve  a  penalty  term  for  the  number  of  parameters,  to 
account  for  the  fact  that  the  model  fit  always  increases  (that  is,  s2  decreases) 
if  more  explanatory  variables  are  included.  The  unrestricted  model  has 
p  =  k,  and  the  restricted  model  obtained  by  deleting  X2  has  p  =  (k  —  g). 
The  model  with  the  smallest  value  of  AIC  or  SIC  is  chosen.  For  n  >  8,  the  SIC 
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imposes  a  stronger  penalty  on  extra  variables  than  AIC,  so  that  SIC  is  more 
inclined  to  choose  the  smaller  model  than  AIC. 

For  the  linear  regression  model,  the  information  criteria  are  related  to  the 
F- test  (5.1).  For  large  enough  sample  size  n,  the  comparison  of  AIC  values 
corresponds  to  an  F-test  with  critical  value  2  and  SIC  corresponds  to  an 
F-test  with  critical  value  log  ( n )  (see  Exercise  5.2).  For  instance,  the  restricted 
model  is  preferred  above  the  unrestricted  model  by  AIC,  in  the  sense  that 
AIC(F  —  g)  <  AIC(F),  if  the  F-test  in  (5.1)  is  smaller  than  2. 


Criteria  based  on  out-of-sample  predictions 

Another  useful  method  for  model  selection  is  to  compare  the  predictive 
performance  of  the  models.  For  this  purpose  the  data  set  is  split  in  two 
parts,  an  ‘estimation  sample’  (used  to  construct  the  model)  and  a  ‘prediction 
sample’  or  ‘hold-out  sample’  for  predictive  evaluation.  So  the  models  are 
estimated  using  only  the  data  in  the  first  subsample,  and  the  estimated 
models  are  then  used  to  predict  the  y-values  in  the  prediction  sample. 
Possible  evaluation  criteria  are  the  root  mean  squared  error  (RMSE)  and 
the  mean  absolute  error  (MAE).  These  are  defined  by 


RMSE 


MAE  =  ^^|yf--yf|, 

nft( 


where  nf  denotes  the  number  of  observations  in  the  prediction  sample  and  y, 
denotes  the  predicted  values. 


Iterative  variable  selection  methods 

In  the  foregoing  we  assumed  that  the  (k  —  g)  variables  in  X\  should  all  be 
included  in  the  model  and  that  the  g  variables  in  Xi  should  either  all 
be  included  or  all  be  deleted.  But  how  should  we  choose  g  and  the  decom¬ 
position  of  the  variables  in  the  two  groups  X\  and  Xi?  We  assume  that  the  k 
regressors  can  be  ordered  in  decreasing  importance  —  that  is,  if  the  ;th 
regressor  is  included  in  the  model  then  also  the  regressors  1, 2,  ■■■,/—  1  are 
included.  It  then  remains  to  choose  the  number  of  regressor  (k  —  g)  to  be 
included  in  the  model.  This  can  be  done,  for  instance,  by  choosing  the  model 
with  the  smallest  value  of  TMSP,  AIC,  or  SIC.  Another  method  is  to  perform 
a  sequence  of  f-tests. 
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In  the  bottom-up  approach  one  starts  with  the  smallest  model  (including 
only  the  constant  term,  corresponding  to  g  =  k  —  1)  and  tests  Ho  :  /f2  =  0 
against  Hi : /f2  ^  0  (in  the  model  with  g=k  —  2).  If  this  hypothesis  is 
rejected,  then  the  second  regressor  is  included  in  the  model  and  one  tests 
iHo  :  y53  =  0  against  Hi :  /f3  ^  0  (in  the  model  with  g  =  k  —  3).  Variables  are 
added  in  this  way  until  the  next  regressor  is  not  significant  anymore.  This  is 
also  called  the  specific-to-general  method  and  it  is  applied  much  in  practice, 
as  it  starts  from  simple  models. 

In  the  top-down  approach  one  starts  with  the  largest  model  (with  g  =  0) 
and  tests  Ho  :  flk  =  0  against  Hi  :  [lk  ^  0.  If  this  hypothesis  is  not  rejected, 
then  one  tests  Hq:  flk  =  =  0  and  so  on.  Variables  are  deleted  until  the 

next  regressor  becomes  significant.  This  is  also  called  the  general-to-specific 
method  and  it  has  the  attractive  statistical  property  that  all  tests  are  per¬ 
formed  in  correctly  specified  models.  In  contrast,  in  the  specific-to-general 
approach,  the  initial  small  models  are  in  general  misspecified,  as  they  will 
exclude  relevant  regressors. 

Variants  of  this  approach  can  also  be  applied  if  the  regressors  cannot  be 
ordered  in  decreasing  importance.  The  method  of  backward  elimination 
starts  with  the  full  model  (with  g  —  0)  and  deletes  the  variable  that  is  least 
significant.  In  the  second  step,  the  model  with  the  remaining  k  —  1  regressors 
is  estimated  and  again  the  least  significant  variable  is  deleted.  This  is  repeated 
until  all  remaining  regressors  are  significant.  The  method  of  forward  selec¬ 
tion  starts  with  the  smallest  model  (that  includes  only  the  constant  term,  with 
g  =  k  —  1).  Then  the  variable  is  added  that  has  the  (in  absolute  sense)  largest 
t-value  (this  involves  k  —  1  regressions  in  models  that  contain  a  constant  and 
one  other  regressor).  This  is  repeated  until  none  of  the  additional  regressors  is 
significant  anymore. 

Example  5.1:  Bank  Wages  (continued) 

As  an  illustration  we  consider  again  the  data  on  wages  and  education  of  474 
employees  of  a  US  bank  that  were  analysed  in  foregoing  chapters.  The 
relation  between  education  and  wage  may  be  non-linear  because  the  mar¬ 
ginal  returns  of  schooling  may  depend  on  the  attained  level  of  education.  We 
will  discuss  (i)  the  data  and  possible  nonlinearities  in  the  wage  equation,  (ii)  a 
class  of  polynomial  models,  (iii)  selection  of  the  degree  of  the  polynomial  by 
means  of  different  selection  criteria,  and  (iv)  a  forecast  evaluation  of  the 
models. 

(i)  The  data  and  possible  non-linearities  in  the  wage  equation 

The  dependent  variable  (y)  is  the  logarithm  of  yearly  wage,  and  as  regressors 
we  take  the  variables  education  (x,  the  number  of  years  of  education),  gender 
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(Dg  =  0  for  females,  Dg  =  1  for  males),  and  minority  (Dm  =  0  for  non¬ 
minorities,  Dm  =  1  for  minorities).  Exhibit  5.2  (a)  shows  the  partial  regres¬ 
sion  scatter  plot  of  wage  against  education  (after  regressions  on  a  constant 
and  the  variables  Dg  and  D,n).  This  plot  indicates  the  possibility  of  a  non¬ 
linear  relation  between  education  and  wage. 

(ii)  Polynomial  models  for  the  wage  equation 

One  method  to  incorporate  non-linearities  is  to  consider  polynomial  models 
of  the  form 

y  =  a  +  yDg  +  juDm  +  PiX  +  +  ■  ■  ■  +  PpX P  +  e. 

The  constant  term  and  the  variables  Dg  and  Dm  are  included  in  all  models, 
and  the  question  is  how  many  powers  of  x  to  include  in  the  model.  These 
variables  are  ordered  in  a  natural  way  —  that  is,  if  xp  is  included  in  the  model, 
then  x'  is  also  included  for  all  /  <  p.  For  evaluation  purposes  we  leave  out  the 
fifty  observations  corresponding  to  employees  with  the  highest  education 
(x  >  17).  The  remaining  424  observations  (with  x  <  16)  are  used  to  estimate 
models  with  different  values  of  p. 

(iii)  Selection  of  the  degree  of  the  polynomial  model 

Exhibit  5.2  ( b )  and  (c)  show  plots  for  p  =  1  of  the  residuals  against  x  and  of 
the  value  of  y  against  the  fitted  value  y.  Both  plots  indicate  some  non- 
linearities.  Exhibit  5.2  ( d )  and  (e)  show  the  same  two  plots  for  the  model 
with  p  =  2.  There  are  less  indications  for  remaining  non-linearities  in  this 
case.  Exhibit  5.3  summarizes  the  outcomes  of  the  models  with  degrees 
p  =  1,2,  3,  4.  If  we  use  the  adjusted  R2  as  criterion,  then  p  =  4  is  optimal. 
If  we  use  the  f-test  on  the  highest  included  power  of  x  (‘bottom  up’),  then  this 
would  suggest  taking  p  =  3  (for  a  significance  level  of  5  per  cent).  If  we 
perform  E-tests  on  the  significance  of  the  highest  powers  in  the  model  with 
p  =  4  (‘top  down’),  then  p  =  3  is  again  preferred.  The  AIC  and  SIC  criteria 
also  prefer  the  model  with  p  =  3. 

(iv)  Forecast  evaluation  of  the  models 

Although  the  foregoing  results  could  suggest  selecting  the  degree  of  the 
polynomial  model  as  p  =  3  or  p  =  4,  the  models  with  p  =  1  and  p  =  2 
provide  much  better  forecasts.  The  model  with  degree  p  =  2  is  optimal 
from  this  perspective. 

This  is  also  illustrated  by  the  graphs  in  Exhibit  5.2  (f  -  i),  which  show 
that  for  p  =  3  and  p  —  4  the  forecasted  wages  of  the  fifty  employees  with 
the  highest  education  are  larger  than  the  actual  wages.  This  means  that  the 
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(a) 


EDUC  LOGSALFIT 

Exhibit  5.2  Bank  Wages  (Example  5.1) 

(a):  partial  regression  scatter  plot  of  wage  (in  logarithms)  against  education  (after  regressions 
on  a  constant  and  the  variables  ‘gender’  and  ‘minority’,  474  employees),  (b)  and  (c):  scatter 
diagrams  of  residuals  against  education  (b)  and  of  wage  against  fitted  values  for  the  (linear) 
model  with  p  =  1  (c)  using  data  of  424  employees  with  EDUC  <  16.  (d)  and  (e):  two  similar 
scatter  diagrams  for  the  (quadratic)  model  with  p  =  2. 
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10  12  14  16  18  20  10  15  20  25  30  35  40 

FORECAST  LOGSAL  FORECAST  LOGSAL 

Exhibit  5.2  ( Contd .) 

Scatter  diagrams  of  actual  wages  against  forecasted  wages  (both  in  logarithms)  for  fifty 
employees  with  highest  education  (>  17  years),  based  on  polynomial  models  with  different 
values  of  p  {p  =  1,2, 3,4). 


Criterion 

P= 1 

P  =  2 

P  =  3 

p  =  4 

Adjusted  R2 

0.4221 

0.4804 

0.5620 

0.5628* 

P-values  ‘bottom  up’  t-test 

0.0000 

0.0000 

0.0000* 

0.1808 

P-values  ‘top  down’  P-test 

0.0000 

0.0000 

0.0000* 

0.1808 

AIC 

-0.0400 

-0.1440 

-0.3125* 

-0.3121 

SIC 

-0.0019 

-0.0963 

-0.2552* 

-0.2452 

RMSE  of  forecasts 

0.4598 

0.2965* 

2.7060 

7.3530 

MAE  of  forecasts 

0.4066 

0.2380* 

2.3269 

5.9842 

Exhibit  5.3  Bank  Wages  (Example  5.1) 

Model  selection  criteria  applied  to  wage  data;  an  *  indicates  the  optimal  degree  of  the 
polynomial  model  for  each  criterion. 
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models  with  larger  values  of  p  do  not  reflect  systematic  properties  of  the 
wage-education  relation  for  higher  levels  of  education. 

Exercises:  T:  5.1,  5.2a-d 


5.2.2  Non-linear  functional  forms 

A  general  misspecification  test:  RESET 

In  the  foregoing  chapters  the  functional  relation  between  the  dependent 
variable  and  the  explanatory  variables  was  assumed  to  be  known  up  to  a 
set  of  parameters  to  be  estimated.  The  linear  model  is  given  by 

k 

y,  =  x'fi  +  £;  =  fa  +  y~]  PjXj,  +  £,'.  (5.2) 

7=2 

Instead  of  this  linear  relation,  it  may  be  that  the  dependent  variable  depends  in 
a  non-linear  way  on  the  explanatory  variables.  To  test  this,  we  can,  for 
example,  add  quadratic  and  cross  product  terms  to  obtain  the  model 

k  k  k  k 

Vi  =  Pi  +  Pixi’  +  yiixf>  +  E  E  yjhXfiXhi  + 

7=2  7=2  7=2  h=j+ 1 

A  test  for  non-linearity  is  given  by  the  F- test  for  the  |  k(k  —  1 )  restrictions  that 
all  parameters  are  zero.  This  may  be  impractical  if  k  is  not  small.  A  simpler 
test  is  to  add  a  single  squared  term  to  the  linear  model  (5.2)  — for  example,  yd, 
where  y,  =  x'b  with  b  the  OLS  estimator  in  (5.2).  This  gives  the  test  equation 

Vi  =  x'iP  +  ytf  +  Si.  (5.3) 

Under  the  null  hypothesis  of  a  correct  linear  specification  in  (5.2)  there  holds 
y  =  0,  which  can  be  tested  by  the  f-test  in  (5.3).  This  is  called  the  regression 
specification  error  test  (RESET)  of  Ramsey.  As  b  depends  on  y,  this  means 
that  the  added  regressor  yf  in  (5.3)  is  stochastic.  Therefore  the  f-test  is  valid 
only  asymptotically  under  the  assumptions  stated  in  Section  4.1.4  (p.  197). 
To  allow  for  higher  order  non-linearities  we  can  include  higher  order  terms  in 
the  RESET  —  that  is, 


Vi  =  x*iP  +  '*T/yj(yiy+1  +£,. 
7=1 


(5.4) 
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The  hypothesis  that  the  linear  model  is  correctly  specified  then  corresponds 
to  the  F(p,n  —  k—p)  test  on  the  joint  significance  of  the  parameters 

Some  meaningful  non-linear  specifications 

The  RESET  is  a  misspecification  test.  That  is,  it  tests  the  null  hypothesis  of 
correct  specification,  but  if  the  null  hypothesis  is  rejected  it  does  not  tell  us 
how  to  adjust  the  functional  form.  If  possible,  the  choice  of  an  alternative 
model  should  be  inspired  by  economic  insight. 

In  the  linear  model  (5.2),  the  marginal  effects  of  the  explanatory  variables 
on  the  dependent  variable  are  constant  —  that  is,  for  /'  =  2,  •  •  • ,  k. 

Alternative  models  can  be  obtained  by  assuming  different  forms  for  these 
marginal  effects.  We  discuss  some  possible  models,  and  for  simplicity  we 
assume  that  k  =  3. 

It  may  be  that  the  marginal  effect  depends  on  the  level  of  xi  —  say, 
=  jS2  +  >’2*2  •  This  can  be  modelled  by  including  the  squared  term  x\  in  the 
model  so  that 


Vi  ~  Pi  +  Plx2i  +  j72x2i  +  @3x3i  +  Si¬ 
lt  may  also  be  that  the  marginal  effect  depends  on  the  level  of  another 
variable  —  say,  =  p2  _|_  y3x3.  This  can  be  modelled  by  including  the  prod¬ 
uct  term  X2X3  in  the  model,  so  that 

y,  =  Pi  +  p2xH  +  Pix3i  +  l3x2ix3i  +  Si- 

The  term  X2,X3,  is  called  an  interaction  term.  The  above  two  specifications 
provide  non-linear  functional  forms  with  a  clear  interpretation.  As  these 
models  remain  linear  in  the  unknown  parameters,  they  can  be  estimated  by 
(linear)  least  squares.  Other  methods  to  deal  with  non-linearities  are  to  use 
non-parametric  techniques,  to  transform  the  data,  or  to  use  varying  param¬ 
eters.  This  is  discussed  in  Sections  5.2.3,  5.2.4,  and  5.3.2  respectively. 

Example  5.2:  Bank  Wages  (continued) 

We  consider  again  the  wage  data  discussed  in  Example  5.1,  with  education 
(x),  gender  (Dg),  and  minority  ( D,„ )  as  explanatory  variables.  The  linear 
model  is  given  by 


y  =  a  +  yDg  +  fiDm  +  [lx  +  e. 

As  was  discussed  in  Example  5.1,  the  wage  equation  may  be  non-linear.  We 
will  now  discuss  (i)  tests  on  non-linearities,  (ii)  a  non-linear  model  with  non¬ 
constant  marginal  returns  to  schooling,  and  (iii)  the  results  of  this  model. 
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(i)  Tests  on  non-linearities 

Recall  that  y  is  the  logarithm  of  yearly  wage  S,  so  that  fl  =  dy/dx  = 
9  log  (S)/dx  =  (dS/dx)/S  measures  the  relative  wage  increase  due  to 
an  additional  year  of  education.  The  above  linear  model  assumes  that  this 
effect  of  education  is  constant  for  all  employees.  The  results  of  two  RESETs 
(with  p  =  1  and  with  p  =  2  in  (5.4))  are  in  Panels  2  and  3  of  Exhibit  5.4.  Both 
tests  indicate  that  the  linear  model  is  misspecified.  Note  that  in  the  model 
with  p  =  2,  the  two  terms  yf  and  yj  are  individually  not  significant  but 


Panel  1:  Dependent  Variable:  LOGSALARY 
Method:  Least  Squares 
Sample:  1  474 


Included  observations:  474 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

9.199980 

0.058687 

156.7634 

0.0000 

EDUC 

0.077366 

0.004436 

17.44229 

0.0000 

GENDER 

0.261131 

0.025511 

10.23594 

0.0000 

MINORITY 

-0.132673 

0.028946 

-4.583411 

0.0000 

Panel  2:  Ramsey  RESET  Test: 

F-statistic 

77.60463 

Probability 

0.000000 

Log  likelihood  ratio 

72.58029 

Probability 

0.000000 

Test  Equation:  Dependent  Variable:  LOGSALARY 

Method:  Least  Squares 

Sample:  1  474 

Included  observations:  474 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-69.82447 

8.970686 

-7.783627 

0.0000 

EDUC 

-1.443306 

0.172669 

-8.358791 

0.0000 

GENDER 

-4.877462 

0.583791 

-8.354812 

0.0000 

MINORITY 

2.488307 

0.298731 

8.329595 

0.0000 

FITTEDA2 

0.947902 

0.107602 

8.809349 

0.0000 

Panel  3:  Ramsey  RESET  Test: 

F-statistic 

40.23766 

Probability 

0.000000 

Log  likelihood  ratio 

75.21147 

Probability 

0.000000 

Test  Equation:  Dependent  Variable:  LOGSALARY 

Method:  Least  Squares 

Sample:  1  474 

Included  observations:  474 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

827.2571 

555.8566 

1.488256 

0.1374 

EDUC 

10.63188 

7.483135 

1.420779 

0.1560 

GENDER 

35.89400 

25.26657 

1.420612 

0.1561 

MINORITY 

-18.22389 

12.83565 

-1.419787 

0.1563 

FITTEDA2 

-14.11083 

9.330216 

-1.512380 

0.1311 

FITTEDA3 

0.483936 

0.299821 

1.614082 

0.1072 

Exhibit  5.4  Bank  Wages  (Example  5.2) 

Panel  1:  regression  of  wage  (in  logarithms)  on  education,  gender,  and  minority.  Panels  2  and  3: 
RESET,  respectively  with  p  =  1  and  with  p  =  2. 
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Panel  4:  Dependent  Variable:  LOGSALARY 

Method:  Least  Squares 

Sample:  1  474 

Included  observations:  474 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

10.72379 

0.182800 

58.66402 

0.0000 

GENDER 

0.234323 

0.123432 

1.898401 

0.0583 

MINORITY 

0.315020 

0.136128 

2.314151 

0.0211 

EDUC 

-0.171086 

0.028163 

-6.074851 

0.0000 

EDUCA2 

0.009736 

0.001117 

8.717483 

0.0000 

GENDER*  EDUC 

-0.002213 

0.009350 

-0.236632 

0.8130 

MINORITY*  EDUC 

-0.032525 

0.010277 

-3.164785 

0.0017 

Panel  5:  Dependent  Variable:  LOGSALARY 

Method:  Least  Squares 

Sample:  1  474 

Included  observations:  474 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

10.72135 

0.182324 

58.80378 

0.0000 

GENDER 

0.205648 

0.023451 

8.769373 

0.0000 

MINORITY 

0.322841 

0.131921 

2.447231 

0.0148 

EDUC 

-0.169464 

0.027289 

-6.209968 

0.0000 

EDUCA2 

0.009624 

0.001010 

9.529420 

0.0000 

MINORITY*  EDUC 

-0.033074 

0.010002 

-3.306733 

0.0010 

Exhibit  5.4  ( Contd .) 

Non-linear  models  with  quadratic  term  for  education  and  with  interaction  terms  for 
education  with  gender  and  minority  (Panel  4)  and  with  the  insignificant  interaction  term 
GENDER* EDUC  omitted  (Panel  5). 

jointly  they  are  highly  significant.  This  is  because  of  multicollinearity,  as  the 
terms  yf  and  yf  have  a  correlation  coefficient  of  0.999871.  The  reason  for 
this  high  correlation  is  that  the  logarithmic  salaries  yt  vary  only  between  9.66 
and  10.81  in  the  sample  (corresponding  to  salaries  ranging  from  $15,750  to 
$135,000). 

(ii)  A  non-linear  model 

As  a  possible  alternative  model  we  investigate  whether  the  marginal  returns 
of  schooling  depend  on  the  level  of  (previous)  education  and  on  the  variables 
gender  and  minority  —  that  is, 


—  —  /ii  +  2fi2x  +  P2Dg  +  P4Dm. 

This  motivates  a  model  with  quadratic  term  and  interaction  effects 
y  =  a  +  yDg  +  f.i Dm  +  p4x  +  fi2x^  +  p2DgX  +  p4 Dmx  +  £. 
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(iii)  Results  of  the  non-linear  model 

The  estimated  model  is  in  Panel  4  of  Exhibit  5.4.  The  regression  coefficient  bi 
is  not  significant.  The  estimated  model  obtained  after  deleting  the  regressor 
Dgx  is  given  in  Panel  5  of  Exhibit  5.4.  The  marginal  returns  of  schooling  are 
estimated  as 

^  =  bx  +  Ibix  +  b4Dm  =  -0.169  +  0.019x  -  0.033 Dm. 
ox 

For  instance,  for  an  education  level  of  x  =  16  years  an  additional  year 
of  education  gives  an  estimated  wage  increase  of  13.8  per  cent  for  non¬ 
minorities  and  of  10.5  per  cent  for  minorities. 


5.2.3  Non-parametric  estimation 
Non-parametric  model  formulation 

The  methods  discussed  in  the  foregoing  section  to  deal  with  non-linear 
functional  forms  require  that  the  non-linearity  is  explicitly  modelled  in 
terms  of  a  limited  number  of  variables  (such  as  squared  terms  and  interaction 
terms)  and  their  associated  parameters.  Such  methods  are  called  parametric, 
as  the  non-linearity  is  modelled  in  terms  of  a  limited  number  of  parameters. 
Non-linearity  can  also  be  modelled  in  a  more  flexible  way,  by  means  of  so- 
called  non-parametric  models.  In  this  section  we  will  discuss  the  main  ideas 
by  considering  the  situation  of  a  scatter  of  points  (x„  y,),  i=  1,-  ■  ■  ,n.  Instead 
of  the  simple  linear  regression  model  that  requires  a  linear  dependence  in  the 
sense  that  y  =  a  +  fix  +  £,  it  is  assumed  that 

y  =  f(x)  +  z, 

where  the  function  f  is  unknown.  In  particular,  it  may  be  non-linear  in  the 
explanatory  variable  x.  It  is  assumed  that  E[fi]  =  0,  or,  in  the  case  where  the 
regressor  x  is  stochastic,  that  E[fi|x]  =  0.  This  means  that  the  (parametric) 
assumption  of  the  linear  regression  model  that  E[y|x]  =  a  +  fix  is  replaced  by 
the  (non-parametric)  assumption  that  E[y|xJ  =  f(x).  That  is,  f(x)  can  be 
interpreted  as  the  expectation  of  y  for  a  given  value  of  x. 

Local  regression  with  nearest  neighbour  fit 

We  now  describe  a  procedure  called  local  regression  to  estimate  the  function 
f(x).  This  estimation  method  is  called  local  because  the  function  values  f(x) 
are  estimated  locally,  for  (a  large  number  of)  fixed  values  of  x.  We  describe 
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the  estimation  of  f(x o)  at  a  given  point  xo.  The  function  f(x)  can  then  be 
estimated  by  repeating  the  procedure  for  a  grid  of  values  of  x.  It  is  assumed 
that  the  function  f  is  smooth,  in  particular,  differentiable  at  xq.  This  implies 
that,  locally  around  xo,  the  function  f  can  be  approximated  by  a  linear 
function  —  that  is, 


f(x)  «  a0  +  Po(x  ~  *o), 

where  ao  =  f(x o)  and  /?0  is  the  derivative  of  the  function  f  at  xo-  The  basic 
idea  of  local  regression  is  to  use  the  observations  (x„  y,)  with  x,-values  that 
are  close  enough  to  xo  to  estimate  the  parameters  ao  and  in  the  model 

y,  =  a0  +  P0(xj  -  xo)  +  t ot. 

As  the  linear  function  is  only  an  approximation,  we  denote  the  error  by  a  new 
disturbance  term  to.  If  we  consider  a  point  xo  that  is  present  in  the  observed 
data  set  —  say,  for  observation  zo  so  that  x,0  =  xo  —  then 

E[y. oKl  =  /"(*<o)  =  ao- 

That  is,  in  this  case  the  estimate  of  the  constant  term  ao  can  be  interpreted  as 
an  estimate  of  the  function  value  f(xt0). 

The  linear  approximation  is  more  accurate  for  values  of  x,  that  are  closer 
to  xo,  and  this  motivates  the  use  of  larger  weights  for  such  observations. 
Therefore,  instead  of  estimating  ao  and  /?0  by  ordinary  least  squares,  the 
parameters  are  estimated  by  minimizing  the  weighted  sum  of  squares 

uj,  [y,  -a0-  P0(x,  -xo)')  ■ 


This  is  called  weighted  least  squares.  In  particular,  we  can  exclude  observa¬ 
tions  with  values  of  x,  that  are  too  far  away  from  xo  (and  that  include  no 
reliable  information  on  f(x o)  anymore)  by  choosing  weights  with  Wj  =  0  if 
x,  —  xo|  is  larger  than  a  certain  threshold  value.  In  this  case  only  the  obser¬ 
vations  for  which  x,  lies  in  some  sufficiently  close  neighbourhood  of  xo  are 
included  in  the  regression.  This  is,  therefore,  called  a  regression  with  nearest 
neighbour  fit. 

Choice  of  neighbourhood 

To  apply  the  above  local  regression  method,  we  have  to  choose  which 
observations  are  included  in  the  regression  (that  is,  the  considered 
neighbourhood  of  xq)  and  the  weights  of  the  included  observations.  We  will 
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discuss  a  method  for  choosing  neighbourhoods  and  weights  that  is  much 
applied  in  practice. 

The  neighbourhood  can  be  chosen  by  selecting  the  bandwidth  span,  also 
called  the  span.  This  is  a  number  0  <  s  <  1  representing  the  fraction  of  the  n 
observations  (x,,  yf)  that  are  included  in  the  regression  to  estimate  ao  and  /?0. 
The  selected  observations  are  the  ones  that  are  closest  to  xo  —  that  is,  the  sn 
nearest  neighbours  of  xo,  and  the  other  (1  —  s)n  observations  with  largest 
values  of  \x,  —  xo|  are  excluded  in  the  regression.  One  usually  chooses  the 
bandwidth  span  around  s  =  0.6  or  s  =  0.7.  Smaller  values  may  lead  to 
estimated  curves  that  are  overly  erratic,  whereas  larger  values  may  lead  to 
very  smooth  curves  that  miss  relevant  aspects  of  the  function  f.  It  is  often 
instructive  to  try  out  some  values  for  the  bandwidth  span  —  for  instance, 
s  =  0.3,  s  =  0.6,  and  s  =  0.9  —  and  then  to  decide  which  estimated  curve  has 
the  best  interpretation. 

Choice  of  weights 

After  the  selection  of  the  relevant  neighbourhood  of  xo,  the  next  step  is  to 
select  the  weights  of  the  included  observations.  These  weights  decrease  for 
observations  with  a  larger  distance  between  x,  and  xo.  Let  D  be  the  maximal 
distance  \x,  —  xo|  that  occurs  for  the  sn  included  observations,  and  let 
dj  =  x,  —  xo| /D  be  the  scaled  distance  of  the  z'th  observation  from  xo  (so 
that  0  <  dj  <  1  for  all  included  observations).  A  popular  weighting  function 
is  the  so-called  tricube  weighting  function,  defined  by 

w,  =  (l  —  d^Y  for  0  <  dj  <  1. 

The  largest  weight  is  given  when  d,  =  0  (that  is,  when  x,  =  xo)  and  the 
weights  gradually  decrease  to  zero  as  dj  tends  to  1  (the  upper  bound). 
The  graph  of  the  tricube  function  is  shown  in  Exhibit  5.5. 


The  tricube  weighting  function  w  =  (1  —  d3)3  for  0  <  d  <  1 . 
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Some  extensions 

The  tricube  function  is  only  one  out  of  a  number  of  possible  weighting  functions 
that  are  used  in  practice.  In  most  cases  the  choice  of  the  bandwidth  span  is  crucial, 
whereas  the  estimates  for  a  given  bandwidth  span  do  not  depend  much  on  the 
chosen  weights.  Note  that  the  weights  of  the  tricube  function  will  in  general  not 
add  up  to  unity,  and  the  same  holds  true  for  other  weighting  functions.  This  is  not 
important,  as  the  choice  of  scaled  weights  (with  weights  Wj/  wji  where  the  sum 
runs  over  the  sn  included  observations)  gives  the  same  estimates  of  ao  and  /?0. 
Since  the  weights  need  not  add  up  to  unity,  the  weighting  functions  are  often  called 
kernel  functions. 

The  local  linear  specification  y,-  =  ao  +  Po(x,  —  xq)  +  ui,  is  recommended  in  most 
cases,  but  sometimes  one  uses  regressions  with  only  a  constant  term 


y<  —  ao  +  W;, 


or  regressions  with  a  second  degree  polynomial 


y\  =  °=o  +  Po(x>  -  xo)  +  7o (*<  ~  xof  + 

The  version  with  only  the  constant  term  was  the  first  one  that  was  developed  and 
is  usually  called  the  kernel  method.  It  has  the  disadvantage  that  it  leads  to  biased 
estimates  near  the  left  and  right  end  of  the  curve,  whereas  the  local  linear  regres¬ 
sion  method  is  unbiased. 

Local  regression  is  most  often  used  to  draw  a  smooth  curve  through  a  two- 
dimensional  scatter  plot.  It  is,  however,  also  possible  to  use  it  with  k  regressors, 
but  it  is  less  easy  to  get  a  good  graphical  feeling  for  the  obtained  estimates. 


Summary  of  local  linear  regression 

To  estimate  a  non-linear  curve  y  =  f(x)  from  a  scatter  of  points  (x„  y,), 
i  =  1,  •  •  • ,  n,  by  means  of  local  linear  regression,  one  takes  the  following 
steps. 


Local  regression 

•  Step  1:  Choice  of  grid  of  points.  Choose  a  grid  of  points  for  the  variable  x 
where  the  function  f(x)  will  be  estimated.  If  the  number  of  observations  is 
not  too  large,  one  can  take  all  the  n  observed  values  x,;  otherwise  one  can 
estimate  f(x)  only  for  a  selected  subsample  of  these  values. 

•  Step  2:  Choice  of  bandwidth  span.  Choose  the  fraction  s  (with  0  <  s  <  1)  of 
the  observations  to  be  included  in  each  local  regression.  A  usual  choice  is 
s  =  0.6,  but  it  is  advisable  to  try  out  some  other  values  as  well. 

•  Step  3:  Choice  of  weighting  function.  Choose  the  weights  w,  to  be  used  in 
weighted  least  squares.  A  possible  choice  is  the  tricube  weighting  function. 

(i continues ) 
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Local  regression  ( continued ) 

•  Step  4:  Perform  weighted  linear  regressions.  For  each  point  xq  in  the  grid  of 
points  chosen  in  step  1,  perform  a  weighted  linear  regression  by  minimizing 
the  weighted  sum  of  squares  wi(yi  ~  ao  —  Po  (x;  —  tvto)  )2-  Here  the  sum¬ 
mation  runs  over  the  sn  included  points  of  step  3. 

•  Step  5:  Estimated  non-linear  function  y  =  f(x).  For  given  value  of  xq, 
estimate  the  function  value  f(x o)  by  f(x o)  =  ao  with  ao  the  estimated 
constant  term  in  step  4.  The  estimated  function  can  be  visualized  by 
means  of  a  scatter  plot  of  f(x,)  against  x;  for  the  grid  of  points  of  step  1, 
and  a  continuous  curve  is  obtained  by  interpolating  between  the  points 
(xj,f(xj))  in  this  scatter. 


Example  5.3:  Simulated  Data  from  a  Non-linear  Model 

To  illustrate  the  idea  of  local  regression  we  first  apply  the  method  to  a  set  of 
simulated  data.  We  simulate  a  set  of  n  =  200  data  from  the  data  generating 
process  y;  =  sin  (x,-)  +  e„  where  the  x,-  consist  of  a  random  sample  from  the 
uniform  distribution  on  the  interval  0  <  x,  <  2.5  and  the  8,  are  a  random 
sample  from  the  normal  distribution  with  mean  zero  and  standard  deviation 
a  =  0.2,  with  x,  and  e;  independent  for  all  i,  j  =  1,  ■  ■  ■ ,  200. 

Exhibit  5.6  shows  the  scatter  of  the  generated  data  (in  (a))  as  well  as  four 
curves  —  namely,  of  the  data  generating  process  in  (b)  and  of  three  curves 
that  are  estimated  by  local  linear  regression  with  three  choices  for  the 
bandwidth  span,  s  =  0.3  in  (c),  s  =  0.6  in  (d),  and  s  =  0.9  in  (e).  For 
s  =  0.9  the  fitted  curve  is  very  smooth,  but  it  underestimates  the  decline  of 
the  curve  at  the  right  end.  For  s  =  0.3  this  decline  is  picked  up  well,  but  the 
curve  shows  some  erratic  movements  that  do  not  correspond  to  properties  of 
the  data  generating  process.  The  curve  obtained  for  s  =  0.6  provides  a 
reasonable  compromise  between  smoothness  and  sensitivity  to  fluctuations 
that  are  present  in  the  functional  relationship. 

Example  5.4:  Bank  Wages  (continued) 

As  a  second  illustration  we  consider  the  relation  between  education  and 
wages  in  the  banking  sector.  In  Example  5.1  we  found  evidence  for  possible 
non-linearities  in  this  relation.  We  can  also  investigate  this  by  a  local  linear 
regression  of  wage  on  education  (for  simplicity  we  exclude  other  explanatory 
variables  gender  and  minority). 

Exhibit  5.7  (a)  shows  the  scatter  of  the  n  =  474  data,  together  with  four 
fitted  curves  in  (b-e).  The  relation  does  not  seem  to  be  linear,  and  the  returns 
to  education  seem  to  become  larger  for  higher  levels  of  education.  For  this 
data  set  the  local  linear  regression  with  bandwidth  span  s  =  0.9  seems  to  be 
preferable,  as  it  gives  nearly  the  same  results  as  s  =  0.6  but  without  the  small 
irregularities  that  do  not  seem  to  have  a  clear  interpretation. 
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Exhibit  5.6  Simulated  Data  from  a  Non-linear  Model  (Example  5.3) 

Simulated  data  with  local  linear  regression  based  on  nearest  neighbour  fit  with  span  0.6  (a) 
DGP  curve  (b),  and  three  local  linear  regression  curves  with  spans  0.3  (c),  0.6  (d),  and  0.9  (e) 
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Exhibit  5.7  Bank  Wages  (Example  5.4) 

Scatter  diagram  of  salary  (in  logarithms)  against  education  with  linear  fit  and  with  local 
linear  fit  with  span  0.9  (a)  and  four  fitted  curves,  linear  ( b )  and  local  linear  with  spans  0.3 
(c),  0.6  (d),  and  0.9  (e). 
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5.2.4  Data  transformations 

Data  should  be  measured  on  compatible  scales 

If  diagnostic  tests  indicate  misspecification  of  the  model,  one  can  consider 
transformations  of  the  data  to  obtain  a  better  specification.  In  every  empir¬ 
ical  investigation,  one  of  the  first  questions  to  be  answered  concerns  the  most 
appropriate  form  of  the  data  to  be  used  in  the  econometric  model. 

For  linear  models  (5.2)  the  scaling  of  the  variables  is  not  of  intrinsic 
importance,  as  was  discussed  at  the  end  of  Section  3.1.3  (p.  124-5),  although 
for  the  computation  of  the  inverse  of  X'X  in  b  =  (X'X)_1X'y,  it  is  preferable 
that  all  explanatory  variables  are  roughly  of  the  same  order  of  magnitude. 
What  is  more  important  is  that  the  additive  structure  of  the  model  implies 
that  the  variables  should  be  incorporated  in  a  compatible  manner.  For 
example,  it  makes  sense  to  relate  the  price  of  one  stock  to  the  price  of  another 
stock,  and  also  to  relate  the  respective  returns,  but  it  makes  less  sense  to 
relate  the  price  of  one  stock  to  the  returns  of  another  stock.  It  also  makes 
sense  to  relate  the  output  of  a  firm  to  labour  and  capital,  or  to  relate  the 
logarithms  of  these  variables,  but  it  makes  less  sense  to  relate  the  logarithm 
of  output  to  the  levels  of  labour  and  capital.  Of  all  the  possible  data  trans¬ 
formations  we  discuss  two  important  ones,  taking  logarithms  and  taking 
differences. 

Use  and  interpretation  of  taking  logarithms  of  observed  data 

The  logarithmic  transformation  is  useful  for  several  reasons.  Of  course,  it  can 
only  be  applied  if  all  variables  take  on  only  positive  values,  but  this  is  the  case 
for  many  economic  variables.  For  instance,  if  the  dependence  of  the  depend¬ 
ent  variable  on  the  explanatory  variable  is  multiplicative  of  the  form 
ji  =  ocix^e8',  then 


log  (y,)  =  /i,  +  fi2  log  (%i)  +  £h  (5.5) 

with  fl1=\og(oc\)  and  fl2  =  oc2.  This  so-called  log-linear  specification  is 
of  interest  because  the  coefficient  fi2  is  the  elasticity  of  y  with  respect  to 
x  —  that  is, 


^  _  d\og(y,)  _  dy,xi 
2  dlog(Xj)  dx,  y, ' 

It  is  often  more  plausible  that  economic  agents  show  constant  reactions  to 
relative  changes  in  variables  like  prices  and  income  than  to  absolute  changes. 
Further,  the  logarithmic  transformation  may  reduce  skewness  and  hetero- 
skedasticity.  To  illustrate  this  idea,  consider  the  model  (5.5),  where  s,  is 
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normally  distributed.  Then  log  (y,)  is  normally  distributed  with  mean 
Hi  —  Pi  +  f}2  log  (xj)  and  variance  a2,  and  the  original  variable  y,  is  log- 
normally  distributed  with  median  efl',  mean  Ely,]  =  e^+i'7  ;  and  variance 
var(y,)  =  (E[y,])2 (en~  —  1)  (see  Exercise  5.2).  This  means  that  the  distribution 
of  y,  is  (positively)  skewed  and  that  the  standard  deviation  is  proportional  to 
the  level.  These  are  very  common  properties  of  economic  data  and  then  the 
logarithmic  transformation  of  the  data  may  reduce  the  skewness  and  hetero- 
skedasticity. 

Taking  differences  of  observed  data 

Many  economic  time  series  show  a  trending  pattern.  In  such  cases,  the 
statistical  assumptions  of  Chapters  3  and  4  may  fail  to  hold.  For  instance, 
Assumption  1*  of  Section  4.1.2  (p.  193)  requires  the  regressors  to  be  stable  in 
the  sense  that  plim  ( ^  Yl'i=\  xfi )  exists  and  is  finite  for  all  explanatory  variables 
Xj.  In  the  case  of  a  linear  deterministic  trend,  say  x2,  =  i  for  i  =  1,  •  •  • ,  n,  the 
sequence  f  YTi=  1  *2  diverges.  To  apply  conventional  tests,  the  variables  should 
be  transformed  to  get  stable  regressors.  The  trend  in  a  variable  y  can  often  be 
removed  by  taking  first  differences.  This  operation  is  denoted  by  A,  which  is 
defined  by 


Ay,  =  y,  -  y,-i. 

For  instance,  if  Xu  =  i,  then  Ax2 ,  =  1,  which  is  a  stable  regressor. 
A  combination  of  the  two  foregoing  transformations  is  also  of  interest. 
Because 


A  log  (y, )  =  log 


log 


Ay, 

y,-i 


for  Ay,/y,_i  sufficiently  small,  this  transforms  the  original  level  variables  y, 
into  growth  rates.  The  modelling  of  trends  and  the  question  whether  variables 
should  be  differenced  or  not  is  further  discussed  in  Chapter  7. 


The  Box-Cox  transformation 

If  one  doubts  whether  the  variables  should  be  included  in  levels  or  in  logarithms, 
one  can  consider  the  more  general  Box-Cox  transformation  given  by 


yM) 


Pi 


7=2 


PjXjiW 


(5.6) 
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where  y,-(2)  =  (yj  —  l)/2  and  x;,(2)  is  defined  in  a  similar  way.  If  2  =  1,  this 
corresponds  to  a  linear  model,  and,  as  y,{k)  — >  log  (yd  for  k  — >  0,  the  log-linear 
model  is  obtained  for  k  =  0.  The  elasticity  of  y  with  respect  to  Xj  in  (5.6)  is  given  by 
[Ijxfj/yf.  To  estimate  the  parameters  of  the  model  (5.6)  we  assume  that  the 
disturbance  terms  e,,  satisfy  Assumptions  2-4  and  7.  Then  the  logarithm  of  the 
joint  density  function  is  given  by 


n  H  fl  \  H 

log (p(e !,•••,  £„))  =  J>g (P(z,) )  =  -2los(27t)  ~2log^2^  “  2a2 Xe?' 

1=1  i=  1 

To  obtain  the  likelihood  function,  we  use  that  e  i  =  yi{X)  -Pi-  E,t2  PjXji(k) 
so  that  dsj/dyj  =  y\~x .  The  Jacobian  corresponding  to  the  transformation  of 
(ei, •••,£»)  to  (yt, •  •  • ,  y«)  is  therefore  equal  to  njljyi1-1  (see  also  (1-19) 
in  Section  1.2.2  (p.  27)).  The  log-likelihood  is  equal  to  l(Pi,---,  Pk,  k,  a2)  = 
log (p(y l,-  •  y«))  =  {k-  l)E"=il°g(y<)  +  log(p(et,---,  «»)),  so  that 


I  =  -y  log  (271)  -  y  log  (ft2)  +  (2  -  1 )  ^  log  (y,) 

z  z  ,=i 

„  /  *  N2  (5-?) 

-2 ■ 

The  ML  estimates  of  the  parameters  are  obtained  by  maximizing  this  function. 
Note  that  this  differs  from  non-linear  least  squares  in  (5.6),  as  in  the  minimization 
of  E  ^  term  [k  —  1)  E  log  (y,)  in  (5.7)  would  be  neglected.  Actually,  NLS  in 

(5.6)  to  estimate  the  parameters  makes  no  sense.  For  instance,  if  the  values  of  the 
variables  satisfy  y,  >  1  and  */«•  >  1  for  all  i  =  1,  •  •  • ,  n  and  all  j  =  1,  ■  ■  ■ ,  k,  then 
E  £j  0  by  taking  2  — >  — oo. 

Tests  for  a  linear  model  (2  =  1)  or  a  log-linear  model  (2  =  0)  can  be  based  on 

(5.7)  —  for  instance,  by  using  the  LR  test. 
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Example  5.5:  Bank  Wages  (continued) 

We  consider  once  again  the  bank  wage  data  and  investigate  the  best  way  to 
include  the  dependent  variable,  the  salary  of  US  bank  employees,  in  the 
model.  Until  now  we  have  chosen  to  take  the  logarithm  of  salary  as  the 
dependent  variable,  but  there  are  alternatives.  We  will  discuss  (i)  the  choice 
between  salaries  in  levels  or  logarithms,  (ii)  a  test  of  linearity  and  log- 
linearity,  and  (iii)  the  results  and  interpretation  of  an  alternative  relation. 

(i)  Choice  between  levels  and  logarithms 

Exhibit  5.8  (a)  and  ( b )  show  histograms  of  the  salary  ( S )  (in  dollars  per  year) 
and  of  the  natural  logarithm  of  salary  (y  =  log  ( S ) )  of  the  474  employees  of 


5.2  Functional  form  and  explanatory  variables  299 


(a) 


40000  80000  120000 


(b) 


10.0  10.5  11.0  11.5 


(c)  (d) 


EDUC 


EDUC 


Exhibit  5.8  Bank  Wages  (Example  5.5) 


Histograms  of  salary  (a)  and  log  salary  (b),  and  scatter  diagrams  of  salary  against  education  (c) 
and  of  log  salary  against  education  (d). 


the  considered  bank.  The  distribution  of  S  is  more  skewed  than  that  of  y. 
Exhibits  5.8  (c)  and  ( d )  show  scatter  diagrams  of  S  and  y  against  education. 
As  could  he  expected,  the  variation  of  salaries  is  considerably  larger  for  higher 
levels  of  education  than  for  lower  levels  of  education.  This  effect  is  much  less 
pronounced  for  the  variable  y.  This  provides  statistical  reasons  to  formulate 
models  in  terms  of  the  variable  y  instead  of  the  variable  S.  Regression  models 
for  y  also  have  an  attractive  economic  interpretation,  as  dy/dxj  =  (dS/dxj)/S 
measures  the  relative  increase  in  salary  due  to  an  increase  in  the  explanatory 
variable  x7.  We  are  often  more  interested  in  such  relative  effects  than  in 
absolute  effects. 

(ii)  Tests  of  linearity  and  log-linearity 

Now  we  consider  the  following  model  for  the  relation  between  (scaled)  salary 
(the  dependent  variable  is  expressed  in  terms  of  S  =  Salary/$10, 000)  and 
education  (x).  Here  we  scale  the  salary  to  make  the  two  variables  x  and  S  of 
similar  order  of  magnitude. 

-  1 

2 


5(2) 
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LAMBDA 

w)  Panel  2:  Method:  Maximum  Likelihood 

Dependent  variable:  (SAlambda  -  1  (/lambda  with  S  =  Salary/10000 
Included  observations:  474 
Convergence  achieved  after  5  8  iterations 


Parameter 

Coefficient 

Std.  Error 

z-Statistic 

Prob. 

LAMBDA 

-0.835898 

0.111701 

-7.483362 

0.0000 

C 

0.320157 

0.022535 

14.20703 

0.0000 

GENDER 

0.102712 

0.014278 

7.193465 

0.0000 

MINORITY 

-0.046302 

0.011135 

-4.158373 

0.0000 

EDUCATION 

0.025821 

0.003656 

7.062606 

0.0000 

VARIANCE 

0.007800 

0.001986 

3.926990 

0.0001 

Log  likelihood 

-519.9367 

Panel  3:  Method:  Maximum  Likelihood  for  lambda  =  1 

Dependent  variable:  S  —  1  with  S  =  Salary/10000 

Included  observations:  474 

Variable 

Coefficient 

Std.  Error 

z-Statistic 

Prob. 

C 

-13314.27 

2763.358 

-4.818149 

0.0000 

GENDER 

9022.212 

1201.227 

7.510828 

0.0000 

MINORITY 

-5116.840 

1362.978 

-3.754163 

0.0002 

EDUCATION 

3257.199 

208.8534 

15.59562 

0.0000 

Log  likelihood 

-759.5043 

(d)  Panel  4:  Method:  Maximum  Likelihood  for  lambda  =  0 
Dependent  variable:  log(S)  with  S  =  Salary/10000 
Included  observations:  474 


Variable 

Coefficient 

Std.  Error 

z-Statistic 

Prob. 

C 

9.199980 

0.058687 

156.7634 

0.0000 

GENDER 

0.261131 

0.025511 

10.23594 

0.0000 

MINORITY 

-0.132673 

0.028946 

-4.583411 

0.0000 

EDUCATION 

0.077366 

0.004436 

17.44229 

0.0000 

Log  likelihood 

-568.5082 

Exhibit  5.9  Bank  Wages  (Example  5.5) 

Values  of  log-likelihood  for  a  grid  of  values  of  X  (a)  and  ML  estimates,  both  unrestricted 
(Panel  2)  and  under  the  restriction  that  2=1  (Panel  3)  or  X  =  0  (Panel  4). 
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So  we  consider  the  transformation  only  of  the  dependent  variable  and 
not  of  the  regressors.  The  log-likelihood  of  this  model  is  given  by  (5.7), 
replacing  the  last  term  in  parentheses  of  this  expression  by  e,  =  5,(7)  —  a 
—  }’Dgi  —  fiDmj  —  fix,.  The  ML  estimates  are  given  in  Panel  2  of  Exhibit  5.9 
and  the  ML  estimate  of  2  is  2  =  —0.836.  The  exhibit  also  shows  the  results 
for  2  =  1  in  Panel  3  (with  dependent  variable  S  —  1)  and  for  2  =  0  in  Panel  4 
(with  dependent  variable  log  (S)).  The  LR- tests  for  linearity  and  log-linearity 
are  given  by 

LP(2  =  1)  =  2(  -  519.94  +  759.50)  =  479.14  (P  =  0.0000), 

LP(2  =  0)  =  2(  -  519.94  +  568.51)  =  97.14  (P  =  0.0000). 

We  conclude  that  linearity  and  log-linearity  are  rejected. 

(iii)  Interpretation  of  an  alternative  relation 

We  now  use  the  ML  estimates  of  the  above  model  in  Panel  2  of  Exhibit  5.9 
(with  2  =  —0.836)  to  determine  the  relative  increase  in  salary  caused  by  an 
additional  year  of  schooling  —  that  is,  ( dS/dx)/S .  It  is  left  as  an  exercise  (see 
Exercise  5.2)  to  show  that  in  this  model 

dS/dx  fl 

S  1  +  2(a  +  y  Dg  +  i~iDm  +  [lx  +  e) 

In  the  log-linear  model  that  was  considered  in  previous  examples,  2  =  0  and 
the  marginal  return  to  schooling  is  constant.  Now,  in  our  model  with 
2  =  —0.836,  this  return  depends  on  the  values  of  the  explanatory  variables. 
For  instance,  for  an  ‘average’  non-minority  male  employee  (with 
Dg  =  0,  D,„  =  0  and  s  =  0),  the  estimated  increase  is 

dS/dx  jl  0.0258 

5  1  +  2(a  +  fix)  0.732  —  0.022x 

This  means  that  the  marginal  returns  of  schooling  increase  with  the  previ¬ 
ously  achieved  level  of  education.  For  instance,  at  an  education  level  of 
x  =  10  years  the  predicted  increase  in  salary  is  5.0  per  cent,  whereas  for  an 
education  level  of  x  =  20  years  this  becomes  8.6  per  cent.  Such  a  non-linear 
effect  is  in  line  with  our  previous  analysis  in  Examples  5.1  and  5.4. 


'’a?  Exercises:  T:  5.2e,  f. 
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5.2.5  Summary 

In  order  to  construct  a  model  for  the  explanation  of  the  dependent 

variable  we  have  to  make  a  number  of  decisions. 

•  How  many  explanatory  variables  should  be  included  in  the  model?  This 
can  be  investigated  by  means  of  selection  criteria  (such  as  AIC,  SIC),  by 
tests  of  significance  (for  instance,  forward  selection  or  backward  elim¬ 
ination),  and  by  comparing  the  predictive  performance  of  competing 
models  on  a  hold-out  sample. 

•  What  is  the  best  way  to  incorporate  the  variables  in  the  model?  In  many 
cases  the  model  has  a  better  economic  interpretation  if  variables  are 
taken  in  logarithms,  and,  if  the  observed  data  contain  trends,  it  may  be 
worthwhile  to  take  first  differences. 

•  Can  the  relation  between  explanatory  variables  and  explained  variable 
be  expressed  by  a  linear  model  or  is  the  relationship  non-linear?  The 
method  of  local  regression  and  Ramsey’s  RESET  can  be  used  to  get  an 
idea  of  possible  non-linearities. 
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5.3  Varying  parameters 


5.3.1  The  use  of  dummy  variables 

Relaxing  the  assumption  of  fixed  parameters 

In  the  linear  model  y  =  X[1  +  £,  the  ‘direct’  effect  of  a  regressor  Xj  on  the 
dependent  variable  y  is  given  by  dy/dxj  =  The  assumption  of  fixed  par¬ 
ameters  (Assumption  5)  means  that  these  effects  are  the  same  for  all  obser¬ 
vations.  If  these  effects  differ  over  the  sample,  then  this  can  be  modelled  in 
different  ways.  In  Section  5.2.2  we  discussed  the  addition  of  quadratic  terms 
and  product  terms  of  regressors.  In  other  cases  the  sample  can  be  split  in 
groups  so  that  the  parameters  are  constant  for  all  observations  within  a 
group  but  differ  between  groups.  For  example,  the  sampled  population 
may  consist  of  several  groups  that  are  affected  in  different  ways  by  the 
regressors.  This  kind  of  parameter  variation  can  be  modelled  by  means  of 
dummy  variables. 

An  example:  Seasonal  dummies 

For  example,  suppose  that  the  data  consist  of  quarterly  observations  with  a 
mean  level  that  varies  over  the  seasons.  This  can  be  represented  by  the  time 
varying  parameter  model 


k 

3 n  =  a<  +  Pfxi‘ +  s”  (5-8) 

/= 2 

where  a,  takes  on  four  different  values,  according  to  the  season  of  the  z'th 
observation.  This  means  that  a,  =  a,-+4  for  all  z,  as  the  observations  i  and 
(z  +  4)  fall  in  the  same  season.  Now  define  four  dummy  variables 
D/„  h  =  1,  2,  3,  4,  where  D/„  =  1  if  the  z'th  observation  falls  in  season 
b  and  D^j  =  0  if  the  z'th  observation  falls  in  another  season.  These  variables 
are  called  ‘dummies’  because  they  are  artificial  variables  that  we  define 
ourselves.  With  the  help  of  these  dummies,  the  model  (5.8)  can  be  ex¬ 
pressed  as 


Ji  =  D1:  +  a2D2,  +  a  3D3,  +  cc4D4l  + 


i=2 


(5.9) 
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This  is  a  linear  regression  model  with  constant  parameters.  That  is,  the 
parameter  variation  in  (5.8)  is  removed  by  including  dummy  variables  as 
additional  regressors.  In  practice  we  often  prefer  models  that  include  a 
constant  term.  In  this  case  we  should  delete  one  of  the  dummy  variables  in 
(5.9)  from  the  model.  For  instance,  if  we  delete  the  variable  D i,  then  (5.9) 
can  be  reformulated  as 


k 

Ji  =  ®1  +  72^2;  +  73 Dy  +  74D4,  +  y  '  PjXji  +  (5.10) 

7=2 

where  ys  =  as  —  for  s  =  2, 3,4.  The  first  quarter  is  called  the  reference 
quarter  in  this  case  and  the  parameters  ys  measure  the  incremental  effects 
of  the  other  quarters  relative  to  the  first  quarter.  Clearly,  the  parameters  ys  in 
(5.10)  have  a  different  interpretation  from  the  parameters  as  in  (5.9).  For 
instance,  suppose  we  want  to  test  whether  the  second  quarter  has  a  signifi¬ 
cant  effect  on  the  level  of  y.  A  t-test  on  a2  in  (5.9)  corresponds  to  the  null 
hypothesis  that  £[y,]  =  Yhj=i  Pjxii  in  the  second  quarter.  However,  a  t-test  on 
y2  in  (5.10)  corresponds  to  the  null  hypothesis  that  £[y,]  =  oci  +  Yl^=2  Pjxji  in 
the  second  quarter  —  that  is,  that  cq  —  a2.  The  latter  hypothesis  is  more 
interesting.  If  we  delete  another  dummy  variable  from  (5.9)  —  for  instance, 
D4  instead  of  D 1 — then  the  dummy  part  in  (5.10)  becomes 
0C4  +  diDi,  +  <52D2,  +  5^Dy,  where  8S  =  a.s  —  04  for  s  =  1,2, 3.  The  interpret¬ 
ation  of  the  £-test  on  b2  differs  from  that  of  the  f-test  on  y2.  In  general,  models 
with  dummy  variables  can  often  be  formulated  in  different  ways,  and  we  can 
choose  the  one  with  the  most  appealing  interpretation. 


The  use  of  dummies  for  piece-wise  linear  relations 

Dummy  variables  can  also  be  used  to  model  varying  slope  parameters.  For 
instance,  suppose  that  the  dependence  of  y  on  x2  is  continuous  and  piece-wise 
linear  with  slope  /?2  for  x2  <  a  and  with  slope  /f2  +  y2  for  x  >  a.  This  can  be 
formulated  as  follows.  Let  D  be  a  dummy  variable  with  D,  =  0  if  xy  <  a  and 
Dj  =  1  if  xy  >  a;  then 


k 

y,=cc  +  p2X2i  +  y2(x2,  -  a)Dj  +  ^  fijXj,  +  e,. 

7=3 

This  model  has  constant  parameters  and  it  is  linear  in  the  parameters, 
provided  that  the  break  point  a  is  known.  The  null  hypothesis  that  the 
marginal  effect  of  x2  on  y  does  not  vary  over  the  sample  can  be  tested  by  a 
t-test  on  the  significance  of  y2. 
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Example  5.6:  Fashion  Sales 

We  consider  US  retail  sales  data  of  high-priced  fashion  apparel.  The  data  are 
taken  from  G.M.  Allenby,  L.  Jen,  and  R.P.  Leone,  ‘Economic  Trends  and 
Being  Trendy:  The  Influence  of  Consumer  Confidence  on  Retail  Fashion 
Sales’,  Journal  of  Business  and  Economic  Statistics,  14/1  (1996),  103-11. 
We  may  expect  seasonal  fluctuations  in  sales  of  fashion  apparel  —  for  instance, 
because  of  sales  actions  around  the  change  of  seasons.  We  will  discuss  (i)  the 
data  and  the  model,  and  (ii)  estimation  results  and  tests  of  seasonal  effects. 


(i)  The  data  and  the  model 

We  consider  quarterly  data  from  1986  to  1992,  so  that  n  =  28,  and  we 
investigate  whether  there  exists  a  quarterly  effect  in  the  relation  between 
sales  (Si,  real  sales  per  thousand  square  feet  of  retail  space)  and  two  explana¬ 
tory  variables,  purchasing  ability  (A„  real  personal  disposable  income)  and 
consumer  confidence  (Q,  an  index  of  consumer  sentiment).  We  define  four 
quarterly  dummies  D;„/  =  1, 2, 3, 4,  where  D/t  =  1  if  the  zth  observation  falls 
in  quarter  j  and  D;,  =  0  if  the  zth  observation  does  not  fall  in  quarter  The 
general  levels  of  sales  and  the  effect  of  purchasing  ability  and  consumer 
confidence  on  fashion  sales  may  vary  over  the  seasons.  We  suppose  that  the 
standard  Assumptions  1-7  are  satisfied  for  the  model 

4  4  4 

log  (Sj)  =  ai  +  Y  <XjDji  +  Y  PjDi‘  lo§  (A>)  +  Y  yiDi‘  lo§  (C«)  +  £«• 

7=2  7=1  7=1 

The  variation  in  the  coefficients  a  reflects  the  possible  differences  in  the 
average  level  of  retail  fashion  sales  between  seasons. 


(ii)  Estimation  results  and  tests  on  seasonal  effects 

Exhibit  5.10  shows  the  results  of  three  estimated  models.  The  null  hypothesis 
that  the  effects  of  the  variables  A,  and  Q  on  sales  do  not  depend  on  the 
season  corresponds  to  the  six  parameter  restrictions  /?i  =  Pi  =  ft 3  =  P4  and 
Ii  =  li  —  73  —  74-  The  corresponding  F- test  of  this  hypothesis  can  be  com¬ 
puted  from  the  results  in  Panels  1  and  2  of  Exhibit  5.10  —  that  is, 


(0.1993  —  0.1437)/6 
0.1437/(28  -  12) 


1.03  (P  =  0.440). 


Therefore,  the  null  hypothesis  of  constant  parameters  for  ft  and  y  is  not 
rejected.  The  corresponding  restricted  model  has  six  parameters,  and  we 
test  whether  fashion  sales  depend  on  the  season  —  that  is,  we  test  whether 
<22  =  «3  =  CC4  =  0  in  this  model.  The  results  in  Panels  2  and  3  of  Exhibit 
5.10  give 
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Panel  1:  Dependent  Variable:  LOGSALES 

Method:  Least  Squares 

Sample:  1986:1  1992:4 

Included  observations:  28 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-13.20387 

5.833571 

-2.263429 

0.0379 

D2 

12.75170 

12.06633 

1.056800 

0.3063 

D3 

9.671240 

8.959647 

1.079422 

0.2964 

D4 

8.545816 

7.282290 

1.173507 

0.2578 

LOGA*Dl 

2.711783 

0.841984 

3.220704 

0.0053 

LOGA*D2 

1.085208 

1.175397 

0.923269 

0.3696 

LOGA*D3 

0.737792 

0.849956 

0.868036 

0.3982 

LOGA*D4 

1.386734 

0.660245 

2.100334 

0.0519 

LOGC*Dl 

0.933291 

0.445010 

2.097239 

0.0522 

LOGC*D2 

0.003096 

1.006677 

0.003076 

0.9976 

LOGC*D3 

0.860878 

0.609848 

1.411626 

0.1772 

LOGC*D4 

0.587470 

0.342052 

1.717487 

0.1052 

R-squared 

0.910259 

Sum  squared  resid 

0.143700 

Panel  2:  Dependent  Variable:  LOGSALES 

Method:  Least  Squares 

Sample:  1986:1  1992:4 

Included  observations:  28 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-6.139694 

2.870911 

-2.138587 

0.0438 

D2 

0.193198 

0.051066 

3.783329 

0.0010 

D3 

0.313589 

0.051166 

6.128849 

0.0000 

D4 

0.618763 

0.052318 

11.82706 

0.0000 

LOGA 

1.488666 

0.393303 

3.785039 

0.0010 

LOGC 

0.660192 

0.240432 

2.745860 

0.0118 

R-squared 

0.875514 

Sum  squared  resid 

0.199337 

Panel  3:  Dependent  Variable:  LOGSALES 
Method:  Least  Squares 
Sample:  1986:1  1992:4 

Included  observations:  28 _ 

Variable  Coefficient  Std.  Error  t-Statistic  Prob. 

C  1.175230  7.073808  0.166138  0.8694 

LOGA  0.774249  0.986040  0.785210  0.4397 

LOGC  -0.022716  0.587800  -0.038646  0.9695 

R-squared  0.044989  Sum  squared  resid  1.529237 


Exhibit  5.10  Fashion  Sales  (Example  5.6) 


Regressions  of  sales  on  purchasing  ability  and  consumer  confidence  (all  in  logarithms),  with 
seasonal  variation  in  all  parameters  (Panel  1)  or  only  in  the  constant  term  (Panel  2)  or  in  none 
of  the  parameters  (Panel  3). 


(1.5292-  0.1993  )/3 
0.1993/(28  -  6) 


48.93  (P  =  0.000). 


This  hypothesis  is  therefore  clearly  rejected.  Exhibit  5.11  shows  the  residuals 
of  the  model  with  ct2  =  1x3  =  0C4  =  0.  The  residuals  show  a  clear  seasonal 
pattern  with  peaks  in  the  fourth  quarter.  This  can  also  be  interpreted  as  a 
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Exhibit  5.1 1  Fashion  Sales  (Example  5.6) 

Residuals  of  the  model  for  fashion  sales  where  none  of  the  parameters  is  allowed  to  vary  over 
the  seasons  (note  that  time  is  measured  on  the  horizontal  axis  and  that  the  values  of  the 
residuals  are  measured  on  the  vertical  axis). 


violation  of  Assumption  2  that  the  disturbance  terms  have  a  fixed  mean.  The 
seasonal  variation  of  this  mean  is  modelled  by  including  the  three  dummy 
variables  with  parameters  <*2,  1x3,  and  0C4  in  the  model. 

Example  5.7:  Coffee  Sales 

As  a  second  illustration  of  the  use  of  dummy  variables  we  return  to  the  mar¬ 
keting  data  on  coffee  sales  of  two  brands  of  coffee  that  were  discussed  before 
in  Section  4.2.5  (p.  218-21).  We  will  discuss  (i)  the  results  for  the  two  brands 
separately,  (ii)  a  combined  model  for  the  two  brands,  (iii)  a  test  of  constant 
elasticity  in  the  combined  model,  and  (iv)  the  interpretation  of  the  results. 

(i)  Results  for  the  two  brands  separately 

In  Section  4.2.5  we  analysed  the  relation  between  coffee  sales  and  the  applied 
deal  rate  and  we  tested  the  null  hypothesis  of  constant  price  elasticity  for  two 
brands  of  coffee.  Although  scatter  diagrams  of  the  data  indicate  a  decreasing 
elasticity  for  larger  deal  rates  (see  Exhibit  4.5),  we  had  difficulty  in  rejecting 
the  null  hypothesis  of  constant  elasticity  when  this  is  tested  for  the  two 
brands  separately.  A  possible  reason  is  the  small  number  of  observations, 
n  =  12,  for  both  brands. 

(ii)  A  combined  model  for  the  two  brands 

We  will  now  consider  a  model  that  combines  the  information  of  the  two 
brands.  The  model  for  the  effect  of  price  deals  (denoted  by  d)  on  coffee  sales 
(denoted  by  q)  in  Section  4.2.5  is  given  by 


E 


XM507COF 


log  (di)  —  Pi  +  ^  ~  l)  4-  Si- 
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The  price  elasticity  in  this  model  is  equal  to  /?2d^3  (see  Example  4.2,  p.  204), 
and  the  null  hypothesis  of  constant  price  elasticity  corresponds  to  the  par¬ 
ameter  restriction  /?3  =  0.  Now  we  combine  the  data  of  the  two  brands  of 
coffee  by  means  of  the  model 

log  (qi)  =  Du  (fi,  +  (df3  -  1))  +  Dii  (yi  + 1  (dj3  -  1))  +  sh  i=  1, . . . ,  24, 

where  Du  =  1  for  the  observations  of  brand  one  and  Du  =  0  for  the  obser¬ 
vations  of  brand  two,  and  where  Du  =  1  —  Du-  This  model  allows  for  the 
possibility  that  all  regression  coefficients  differ  between  the  two  brands  of 
coffee.  Exhibit  5.12  shows  the  NLS  estimates  of  this  model  in  Panel  1  and  of 
the  restricted  model  with  /?2  =  y2  and  /?3  =  y3  in  Panel  3.  This  corresponds  to 
the  assumption  that  the  elasticities  are  the  same  for  the  two  brands  of 
coffee  —  that  is,  P2d^  =  y2d'/3.  We  do  not  impose  the  condition  /?3  =  y1}  as 
the  level  of  the  sales  are  clearly  different  for  the  two  brands  (see  Exhibit  4.5). 
The  Wald  test  for  the  hypothesis  that  (/?2,  /?3)  =  (y2,  y3)  has  P-value  0.249  (see 
Panel  2).  We  do  not  reject  this  hypothesis  and  therefore  we  will  consider  the 
following  combined  model  for  the  two  brands  of  coffee: 

l°g  (dd  —  Dufi i  +  Dny1  +  ju-  ^df3  —  1  j  +  £,-,  i  =  1, . . . ,  24. 

(iii)  Test  of  constant  elasticity  in  the  combined  model 

We  now  test  the  hypothesis  of  constant  elasticity  in  the  above  combined 
model  for  the  two  brands.  That  is,  we  test  whether  /?3  =  0,  in  which  case 
j ^  ^df3  —  l'j  reduces  to  p2  log  ( d, ) ,  as  in  the  Box-Cox  transformation.  The 
results  in  Panels  3-5  of  Exhibit  5.12  are  used  to  compute  the  values  of 
the  Wald  test,  the  Likelihood  Ratio  test,  and  the  Lagrange  Multiplier  test. 
For  the  Wald  test  (for  a  single  parameter  restriction)  we  use  the  relation 
(4.50)  with  the  t-test,  and  Panel  3  gives 

W  =  — =  ^(-2.520)2  =  7.62  (P  =  0.006). 
n  —  k  20 

The  Likelihood  Ratio  test  is  obtained  from  Panels  3  and  4  and  is  equal  to 

LR  =  2 (h  -  l0)  =  2(22.054  -  18.549)  =  7.01  (P  =  0.008). 

The  Lagrange  Multiplier  test  is  computed  in  a  similar  way  as  described  in 
Sections  4.2.5  (p.  221)  and  4.3.9  (p.  247),  and  Panel  5  gives 


LM  =  nR 2  =  24(0.253)  =  6.08  (P  =  0.014). 
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(iv)  Interpretation  of  the  results 

The  above  test  outcomes  indicate  that  the  null  hypothesis  of  constant  deal 
elasticity  should  be  rejected.  Our  earlier  results  in  Example  4.6  in  Section 
4.3.9  gave  less  clear  conclusions.  This  illustrates  the  power  of  imposing 
model  restrictions,  in  this  example,  the  assumption  that  the  functional  de¬ 
pendence  of  the  elasticity  on  the  deal  rate  is  the  same  for  the  two  brands  of 
coffee.  The  combined  model  is  estimated  for  twenty-four  observations,  so 
that,  in  comparison  with  our  analysis  in  Section  4.2.5,  we  gain  twelve 


Panel  1:  Dependent  Variable:  LOGQ 
Method:  Least  Squares 
Sample:  1  24 
Included  observations:  24 
Convergence  achieved  after  5  iterations 


LOGQ=C(l)*DUMRGCl+C(2)*DUMRGC2+C(3)/C(4)*DUMRGCl 

*(DEALAC(4)-1)+C(5)/C(6)*DUMRGC2‘(DEALAC(6)-1) 

Parameter 

Coefficient 

Std.  Error 

t- Statistic 

Prob. 

C(l) 

5.807118 

0.041721 

139.1879 

0.0000 

C(2) 

4.377804 

0.041721 

104.9294 

0.0000 

C(3) 

10.29832 

3.424515 

3.007235 

0.0076 

C(4) 

-13.43074 

6.936886 

-1.936133 

0.0687 

C(5) 

10.28864 

2.896461 

3.552142 

0.0023 

C  (6) 

-8.595289 

5.024271 

-1.710753 

0.1043 

R-squared 

0.986396 

Sum  squared  resid 

0.187993 

S.E.  of  regression 

0.102196 

Log  likelihood 

24.13832 

Panel  2:  Wald  Test 

Null  Elypothesis: 

C(3)=C(5) 

C(4)=C(6) 

F-statistic 

1.502889 

Probability 

0.249117 

Chi-square 

3.005777 

Probability 

0.222487 

Panel  3:  Dependent  Variable:  LOGQ 
Method:  Least  Squares 
Sample:  1  24 
Included  observations:  24 
Convergence  achieved  after  5  iterations 


LOGQ=C(l)*DUMRGCl+C(2)*DUMRGC2+C(3)/C(4)*(DEALAC(4)— 1) 

Parameter 

Coefficient 

Std.  Error  t- Statistic 

Prob. 

C(l) 

5.778500 

0.037388  154.5565 

0.0000 

C(2) 

4.406421 

0.037388  117.8577 

0.0000 

C(3) 

10.23724 

2.274838  4.500207 

0.0002 

C(4) 

-10.67745 

4.237472  -2.519770 

0.0204 

R-squared 

0.983815 

Sum  squared  resid 

0.223654 

S.E.  of  regression 

0.105748 

Log  likelihood 

22.05400 

Exhibit  5.12  Coffee  Sales  (Example  5.7) 

Regression  of  coffee  sales  on  deal  rate  with  all  parameters  different  for  the  two  brands 
(Panel  1),  test  on  equal  elasticities  for  the  two  brands  (Panel  2),  and  regression  model  with 
equal  elasticities  (but  different  sales  levels)  for  the  two  brands  (Panel  3). 
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Panel  4:  Dependent  Variable:  LOGQ 

Method:  Least  Squares 

Sample:  1  24 

Included  observations:  24 

Variable 

Coefficient 

Std.  Error  t-Statistic 

Prob. 

DUMRGC1 

5.810190 

0.039926  145.5240 

0.0000 

DUMRGC2 

4.438110 

0.03992 6  111.1584 

0.0000 

LOG(DEAL) 

5.333995 

0.427194  12.48611 

0.0000 

R-squared 

S.E.  of  regression 

0.978325 

0.119428 

Sum  squared  resid 

Log  likelihood 

0.299523 

18.54891 

Panel  5:  Dependent  Variable:  RESLOGDEAL 

Method:  Least  Squares 

Sample:  1  24 

Included  observations:  24 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

DUMRGC1 

-0.031689 

0.037388 

-0.847590 

0.4067 

DUMRGC2 

-0.031689 

0.037388 

-0.847590 

0.4067 

LOG(DEAL) 

4.072710 

1.608700 

2.531678 

0.0198 

LOG(DEAL)A2 

-29.25819 

11.23281 

-2.604707 

0.0170 

R-squared 

0.253300 

Sum  squared  resid 

0.223654 

S.E.  of  regression 

0.105748 

Exhibit  5.12  (Contd.) 

Regression  model  for  coffee  sales  with  constant  elasticity  (Panel  4)  and  regression  of  the 
residuals  of  this  model  on  the  gradient  of  the  unrestricted  model  where  the  elasticity  depends 
on  the  deal  rate  (Panel  5). 


observations  at  the  cost  of  one  parameter.  This  gain  of  eleven  degrees  of 
freedom  leads  to  more  clear-cut  conclusions. 


“S’  Exercises:  E:  5.26. 


5.3.2  Recursive  least  squares 

Recursive  estimation  to  detect  parameter  variations 

If  we  want  to  model  varying  parameters  by  means  of  dummy  variables,  we 
should  know  the  nature  of  this  variation.  In  some  situations  the  choice  of 
dummy  variables  is  straightforward  (see  for  instance  Examples  5.6  and  5.7  in 
the  foregoing  section).  However,  in  other  cases  it  may  be  quite  difficult  to 
specify  the  precise  nature  of  the  parameter  variation. 

Now  suppose  that  the  data  can  be  ordered  in  a  natural  way.  For  instance,  if 
the  data  consist  of  time  series  that  are  observed  sequentially  over  time,  then  it 
is  natural  to  order  them  with  time.  If  the  data  consist  of  a  cross  section,  then 
the  observations  can  be  ordered  according  to  the  values  of  one  of  the 
explanatory  variables.  For  such  ordered  data  sets  we  can  detect  possible 
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break  points  by  applying  recursive  least  squares.  For  every  value  of  t  with 
k  +  1  <  t  <  n,  a  regression  is  performed  in  the  model  y,  —  x'fi  +  s,  using  only 
the  (t  —  1)  observations  i  =  1,  ■  ■  ■ ,  t  —  1.  This  gives  an  OLS  estimator  bt- \ 
and  a  corresponding  forecast  yt  =  x[bt-i  with  forecast  error 

ft  =  yt-x'tbt- 1.  (5.11) 

The  recursive  least  squares  estimators  are  defined  as  the  series  of  estimators 
bt.  It  is  left  as  an  exercise  (see  Exercise  5.3)  to  show  that  these  estimators  can 
be  calculated  recursively  by 


bt  =  bt-\  +  Atxtft  (5.12) 

At  =  Af_i  -  —At_ixtx'tAt_ i  (5.13) 

vt 

vt=  1  +x'tAt-ixt,  (5.14) 

where  At  =  (X'Xf)_1  with  Xt  the  t  x  k  regressor  matrix  for  the  observations 

i  =  1,  •  •  • ,  t.  The  result  in  (5.12)  shows  that  the  magnitude  of  the  changes 
bt  —  bt- 1  in  the  recursive  estimates  depends  on  the  forecast  errors  ft  in  (5.11). 
Under  the  standard  Assumptions  1-7,  the  correction  factor  At  is  propor¬ 
tional  to  the  covariance  matrix  of  the  estimator  bt,  so  that  large  uncertainty 
leads  to  large  changes  in  the  estimates. 

Recursive  residuals 

Under  Assumptions  1-7  the  forecast  errors  have  mean  E[ft\  =  0.  As 
yt  =  x'tP  +  £f  is  independent  of  bt~\  (that  depends  only  on  si,  ■  ■  ■ ,  st-i),  it 
follows  that  var (ft)  =  var (yt)  +  var(x'^_i)  =  u2(  1  +  x'tAt- \xt)  =  o2vt.  It  is 
left  as  an  exercise  (see  Exercise  5.3)  to  show  that  the  forecast  errors  ft  are  also 
mutually  independent.  This  means  that,  if  the  model  is  valid  (so  that  in 
particular  the  parameters  are  constant), 

wt  =  ~^=  ~  NID(0,  a2),  t  =  k  +  l,---,n.  (5.15) 

VU 

The  values  of  wt  are  called  the  recursive  residuals.  To  detect  possible  param¬ 
eter  breaks  it  is  helpful  to  plot  the  recursive  estimates  bt  and  the  recursive 
residuals  wt  as  a  function  of  t.  If  the  parameters  are  varying,  then  this  is 
reflected  in  variations  in  the  estimates  bt  and  in  relatively  large  and  serially 
correlated  recursive  residuals  ivt  after  the  break.  Such  breaks  may  suggest 
additional  explanatory  variables  that  account  for  the  break,  or  the  model  can 
be  adjusted  by  including  non-linear  terms  or  dummy  variables,  as  discussed 
in  Sections  5.2.2  and  5.3.1. 
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Example  5.8:  Bank  Wages  (continued) 

We  continue  our  analysis  of  the  bank  wage  data  discussed  in  previous 
sections.  Using  the  notation  introduced  there,  the  model  is 


y,  =  a  +  yDg,  +  iiDmi  +  Pxi  +  e„  (5.16) 

where  y  is  the  logarithm  of  yearly  salary  and  x  the  number  of  completed  years 
of  education.  We  order  the  n  =  474  employees  according  to  the  values  of  x, 
starting  with  the  lowest  education.  The  education  ranges  from  8  to  21  years. 
Employees  with  ranking  number  365  or  lower  have  at  most  15  years  of 
education  (x  <  15),  those  with  ranking  number  366-424  have  x  =  16,  and 
those  with  ranking  number  425-474  have  x  >  16. 

Exhibit  5.13  shows  the  recursive  least  squares  estimates  of  the  constant 
term  a  (in  (a))  and  of  the  marginal  return  of  schooling  /?  (in  ( b )),  together 
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Exhibit  5.13  Bank  Wages  (Example  5.8) 

Recursive  estimates  of  constant  term  (a)  and  of  slope  with  respect  to  education  ( b ),  together 
with  plot  of  recursive  residuals  (c).  The  graphs  also  show  95%  interval  estimates  of  the 
parameters  and  95%  confidence  intervals  for  the  recursive  residuals. 
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with  95  per  cent  interval  estimates.  The  estimates  of  p  show  a  break  after 
observation  365,  suggesting  that  the  returns  may  be  larger  for  higher 
levels  of  education.  The  plot  of  recursive  residuals  in  (c)  shows  mostly 
positive  values  after  observation  365.  This  means  that  for  higher  levels 
of  education  the  wages  are  higher  than  is  predicted  from  the  estimates 
based  on  the  employees  with  less  education.  These  results  are  in  line  with 
our  analysis  of  non-linearities  in  the  previous  examples,  see  Examples  5.1, 
5.4,  and  5.5.  All  these  results  indicate  that  the  effect  of  education  on  salary  is 
non-linear. 

Exercises:  T:  5.3. 


5.3.3  Tests  for  varying  parameters 

The  CUSUM  test  for  the  regression  parameters 

Although  plots  of  recursive  estimates  and  recursive  residuals  are  helpful  in 
analysing  possible  parameter  variations,  it  is  also  useful  to  perform  statistical 
tests  on  the  null  hypothesis  of  constant  parameters.  Such  tests  can  be  based 
on  the  recursive  residuals  ivt  defined  in  (5.15).  Under  the  hypothesis  of 
constant  parameters,  it  follows  from  (5.15)  that  the  sample  mean 
'=k+i  wt  is  normally  distributed  with  mean  zero  and  variance 
A-  Let  =  iufclKU+l  (wt  —  w)2  be  the  unbiased  estimator  of  a2  based 
on  the  recursive  residuals;  then 

Vn  —  k—  ~  t(n  —  k  —  1). 
a 

A  significant  non-zero  mean  of  the  recursive  residuals  indicates  possible 
instability  of  the  regression  parameters.  The  CUSUM  test  is  based  on  the 
cumulative  sums 


Wr=  ^2  r  =  k  +  l,---,n, 

t=k+ 1  s 

where  s2  is  the  OLS  estimate  of  er2  in  the  model  y  =  Xp  +  s  over  the  full  data 
sample  using  all  n  observations.  If  the  model  is  correctly  specified,  then  the 
terms  wt/ a  are  independent  with  distribution  N(0,  1),  so  that  Wr  is  approxi¬ 
mately  distributed  as  N(0,r  —  k).  For  a  significance  level  of  (approximately) 
5  per  cent,  an  individual  value  Wr  differs  significantly  from  zero  if 
Wr\  >  2 Vr  -  k.  It  is  also  possible  to  test  for  the  joint  significance  of  the  set 
of  values  W„  r  =  k  +  1,  •  •  • ,  n.  For  a  significance  level  of  (approximately)  5 
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per  cent,  it  can  be  shown  that  this  set  of  values  indicates  misspecification  of 
the  model  if  there  exists  a  point  r  for  which  |  Wr|  >  0.94S(l+l^t)V~k. 

The  CUSUMSQ  test  for  the  variance 

Large  values  for  one  or  more  recursive  residuals  are  not  necessarily  caused  by 
changes  in  the  regression  parameters  /L  Another  possibility  is  that  the  vari¬ 
ance  a2  of  the  error  terms  is  changing  —  that  is,  the  amount  of  uncertainty  or 
randomness  in  the  observations  may  vary  over  time.  This  can  be  investigated 
by  considering  the  sequence  of  squared  recursive  residuals  uPja1.  If  the 
model  is  correctly  specified,  then  (5.15)  shows  that  these  values  are  approxi¬ 
mately  distributed  as  independent  y2(  1)  variables.  The  CUSUMSQ,  test  is 
based  on  the  cumulative  sums  of  squares 

E 

Sr  =  t=yrL - ,  r  =  k+l, 

E 

t=k+ 1 

For  large  enough  sample  size,  E”=/^+i  wt  ~  so  that  (n  —  k)Sr  is  ap¬ 
proximately  distributed  as  y2(r  —  k)  with  expected  value  r  —  k  and  variance 
2(r  —  k).  So  Sr  has  approximately  a  mean  of  (r  —  k)/(n  —  k)  and  a  variance  of 
2 (r  —  k)/(n  —  k)2.  This  provides  simple  tests  for  the  individual  significance 
of  a  value  of  Sr  (for  fixed  r). 

Note  that  the  values  always  run  from  Sk  =  0  (for  r  =  k)  to  Sn  =  1  (for 
r  =  n),  independent  of  the  values  of  the  recursive  residuals.  Tests  on  the  joint 
significance  of  deviations  of  Sr  from  their  mean  values  have  been  derived, 
where  the  model  is  said  to  be  misspecified  if  there  exists  a  point  r  for 
which  \Sr  |  >  c.  The  value  of  c  depends  on  the  significance  level  and 
on  (n  —  k). 

Interpretation  as  general  misspecification  tests 

Apart  from  the  effects  of  changing  parameters  or  variances,  large  recursive 
residuals  may  also  be  caused  by  exceptional  values  of  the  disturbance  terms  e, 
in  the  relation  y,  =  x'fi  +  £,.  Such  observations  are  called  outliers,  and  this  is 
discussed  in  Section  5.6.  It  may  also  be  the  case  that  breaks  occur  in  the 
explanatory  variables.  For  instance,  if  one  of  the  x,  variables  shows  signifi¬ 
cant  growth  over  the  sample  period,  then  the  linear  approximation 
y,  =  x'fi  +  Ej  that  may  be  acceptable  at  the  beginning  of  the  sample,  for 
small  values  of  x„  may  cause  large  errors  at  the  end  of  the  sample.  That  is, 
the  diagnostic  tests  CUSUM  and  CUSUMSQ  that  are  introduced  here  as 
parameter  stability  tests  are  sensitive  to  any  kind  of  instability  of  the  model, 
not  only  for  changes  in  the  parameters. 
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The  Chow  break  test 

In  some  situations  there  may  be  a  clear  break  point  in  the  sample  and  we 
want  to  test  whether  the  parameters  have  changed  at  this  point.  Let  the  n 
observations  be  split  in  two  parts,  the  first  part  consisting  of  n\  observations 
and  the  second  part  of  the  remaining  n2  =  n  —  n\  observations.  In  order  to 
test  the  hypothesis  of  constant  coefficients  across  the  two  subsets  of  data,  the 
model  can  be  formulated  as 


yi  =  XijSj  +£i 

y2  =  X2p2  +  £2, 


(5.17) 


where  y\  and  y2  are  the  x  1  and  n2  x  1  vectors  of  the  dependent  variable 
in  the  two  subsets  and  X\  and  X2  the  n\  x  k  and  n2  x  k  matrices  of  explana¬ 
tory  variables.  This  can  also  be  written  as 


yi 

yi 


(xt  o 

V  o  x2 


(5.18) 


It  is  assumed  that  the  model  (5.18)  satisfies  all  the  standard  Assumptions  1-7, 
in  particular,  that  all  the  (; n\  +n2)  error  terms  are  independent  and  have 
equal  variance.  The  null  hypothesis  of  constant  coefficients  is  given  by 


Hq  :  =  fi2.  (5.19) 

This  can  be  tested  against  the  alternative  that  /?i  ^  fi2  by  means  of  the  T-test. 
The  number  of  parameters  under  the  alternative  hypothesis  is  2k,  and  the 
number  of  restrictions  in  (5.19)  is  k.  Least  squares  in  the  unrestricted  model 
(5.18)  gives  an  error  sum  of  squares  that  is  equal  to  the  sum  of  the  error  sum 
of  squares  of  the  two  separate  regressions  in  (5.17)  (see  Exercise  5.4).  So  the 
F- test  is  given  by 


(So  ~  Si  -  S2)/k 
(Si  +  S2)/ (n\  +  n2  —  2k) 


(5.20) 


where  So  is  the  error  sum  of  squares  under  the  null  hypothesis  (obtained  by 
regression  in  y  =  Xfi  +  e  over  the  full  sample  of  n  —  n\  +  n2  observations) 
and  where  Si  and  S2  are  obtained  by  the  two  subset  regressions  in  (5.17). 
This  is  called  the  Chow  break  test,  and  under  the  null  hypothesis  of  constant 
parameters  it  follows  the  F(k,  n\  +  n2  —  2k)  distribution.  The  regressions 
under  the  alternative  hypothesis  require  that  n\>k  and  n2  >  k  —  that  is, 
in  both  subsets  the  number  of  observations  should  be  at  least  as  large  as  the 
number  of  parameters  in  the  model  for  that  subset. 
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The  Chow  forecast  test 

The  model  specification  (5.17)  allows  for  a  break  in  the  parameters,  but 
apart  from  this  the  model  structure  is  assumed  to  be  the  same.  The  model 
structure  under  the  alternative  can  also  be  left  unspecified.  Then  the  null 
hypothesis  is  that  y  =  X/f  +  e  holds  for  all  n\  +  «2  observations  and  the 
alternative  is  that  this  model  only  holds  for  the  first  n\  observations  and 
that  the  last  «2  observations  are  generated  by  an  unknown  model.  This  can 
be  expressed  by  the  model 


»1  +«2 

ji  =  xtP  +  y;D/v  +  e<>  (5.21 ) 

j=n  i  +  l 

where  D;  is  a  dummy  variable  with  D/;  =  1  for  i  =  j  and  D;,  =  0  for  i  j.  So, 
for  every  observation  i  >  n\ ,  the  model  allows  for  an  additional  effect  y-  that 
may  differ  from  observation  to  observation.  The  coefficients  y  ■  represent  all 
factors  that  are  excluded  under  the  null  hypothesis  —  for  instance,  neglected 
variables,  another  functional  form,  or  another  error  model.  The  null  hypoth¬ 
esis  of  constant  model  structure  corresponds  to 

Hq  :  jj  =  0  for  all  /  =  n\  +  1,  •  •  • ,  ri\  +  n 2.  (5.22) 

This  can  be  tested  by  the  T-test,  which  is  called  the  Chow  forecast  test.  Using 
the  above  notation,  the  Chow  forecast  test  is  computed  as 

_  (Sp  -  S\)/ri2 

Si/(«i  -  k)  ' 


This  is  exactly  equal  to  the  forecast  test  discussed  in  Section  3.4.3  (p.  173) 
(see  Exercise  5.4).  This  test  can  also  be  used  as  an  alternative  to  the  Chow 
break  test  (5.20)  if  one  of  the  subsets  of  data  contains  less  than  k  observa¬ 
tions. 

Example  5.9:  Bank  Wages  (continued) 

We  continue  our  analysis  of  the  data  on  wages  and  education  where  the  data 
are  ordered  with  increasing  values  of  education  (see  Example  5.8).  We  will 
discuss  (i)  Chow  tests  on  parameter  variations,  and  (ii)  CUSUM  and 
CUSUMSQ  tests. 

(i)  Chow  tests 

To  test  whether  an  additional  year  of  education  gives  the  same  relative 
increase  in  wages  for  lower  and  higher  levels  of  education,  we  perform  a 
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Chow  break  test  and  a  Chow  forecast  test.  The  n  =  474  employees  are  split 
into  two  groups,  one  group  with  at  most  sixteen  years  of  education 
[n\  =  424)  and  the  other  with  seventeen  years  of  education  or  more 
(«2  =  50).  Exhibit  5.14  shows  the  results  of  regressions  for  the  whole  data 
set  (in  Panel  1)  and  for  the  two  subsamples  (in  Panels  2  and  3).  The  Chow 
break  test  (5.20)  is  given  by 


(30.852-  23.403  —  2.941)/4 
(23.403  +  2.941)/(424  +  50-  8) 


19.93  (P  =  0.000). 


Panel  1:  Dependent  Variable:  LOGSALARY 

Method:  Least  Squares 

Sample:  1  474 

Included  observations:  474 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

9.199980 

0.058687 

156.7634 

0.0000 

GENDER 

0.261131 

0.025511 

10.23594 

0.0000 

MINORITY 

-0.132673 

0.028946 

-4.583411 

0.0000 

EDUC 

0.077366 

0.004436 

17.44229 

0.0000 

R-squared 

0.586851 

Sum  squared  resid 

30.85177 

Panel  2:  Dependent  Variable:  LOGSALARY 

Method:  Least  Squares 

Sample:  1  424 

Included  observations:  424 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

9.463702 

0.063095 

149.9906 

0.0000 

GENDER 

0.229931 

0.023801 

9.660543 

0.0000 

MINORITY 

-0.111687 

0.027462 

-4.066947 

0.0001 

EDUC 

0.055783 

0.004875 

11.44277 

0.0000 

R-squared 

0.426202 

Sum  squared  resid 

23.40327 

Panel  3:  Dependent  Variable:  LOGSALARY 

Method:  Least  Squares 

Sample:  425  474 

Included  observations:  50 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

9.953242 

0.743176 

13.39284 

0.0000 

GENDER 

0.830174 

0.263948 

3.145213 

0.0029 

MINORITY 

-0.346533 

0.126096 

-2.748175 

0.0085 

EDUC 

0.019132 

0.041108 

0.465418 

0.6438 

R-squared 

0.302888 

Sum  squared  resid 

2.941173 

Exhibit  5.14  Bank  Wages  (Example  5.9) 

Regressions  of  salary  on  gender,  minority,  and  education  over  full  sample  (Panel  1),  over 
subsample  of  employees  with  at  most  sixteen  years  of  education  (Panel  2),  and  over 
subsample  of  employees  with  seventeen  years  of  education  or  more  (Panel  3). 
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(a) 


| - CUSUM  . 5%  Significance"] 


Exhibit  5.15  Bank  Wages  (Example  5.9) 

Plots  of  CUSUM  and  CUSUMSQ  for  wage  data 
ees  with  index  365  or  lower  have  at  most  fifteen 
366  and  424  have  sixteen  years  of  education 
seventeen  years  of  education  or  more. 


(b) 


| - CUSUM  of  Squares  .  5%  Significance  | 


,  ordered  with  increasing  education.  Employ- 
years  of  education,  those  with  index  between 
and  those  with  index  425  or  higher  have 


The  Chow  forecast  test  (3.58)  gives 


(30852-2^403)750 
23.403/(424-4)  '  1 


0.000). 


The  null  hypothesis  of  constant  returns  of  schooling  is  clearly  rejected. 

(ii)  CUSUM  and  CUSUMSQ  tests 

Exhibit  5.15  shows  plots  of  the  CUSUM  and  CUSUMSQ  tests.  This  shows 
that,  at  the  end  of  the  sample,  the  CUSUM  deviates  significantly  from  zero. 
After  observation  i  =  366  the  recursive  residuals  are  mostly  positive,  mean¬ 
ing  that  predicted  wages  are  smaller  than  the  actual  wages.  This  is  in 
agreement  with  the  recursive  slope  estimate  in  Exhibit  5.13  ( b ),  which 
becomes  larger  after  observation  366.  The  CUSUMSQ  plot  shows  that  the 
squared  recursive  residuals  in  the  first  part  of  the  sample  are  relatively  small 
and  that  the  sum  of  squares  builds  up  faster  after  observation  366.  This  is  a 
further  sign  that  the  returns  of  schooling  are  not  constant  for  different  levels 
of  education. 


-»  Exercises:  T:  5.4;  S:  5.19;  E:  5.24,  5.31a,  b,  f,  5.33b,  d. 


5.3.4  Summary 

An  econometric  model  usually  involves  a  number  of  parameters  that  are 
all  assumed  to  be  constant  over  the  observation  sample.  It  is  advisable  to 
apply  tests  on  parameter  constancy  and  to  adjust  the  model  if  the  param¬ 
eters  seem  to  vary  over  the  sample. 
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•  The  assumption  of  constant  parameters  can  be  tested  by  applying  recur¬ 
sive  least  squares,  by  considering  plots  of  recursive  residuals  and  of  the 
CUSUM  and  CUSUMSQ  statistics,  and  by  means  of  the  break  and 
forecasts  tests  of  Chow. 

•  If  the  parameters  are  not  constant  one  has  to  think  of  meaningful 
adjustments  of  the  model  that  do  have  constant  parameters.  This  may 
mean  that  one  has  to  adjust  the  specification  of  the  model  —  for  in¬ 
stance,  by  choosing  an  appropriate  non-linear  model  or  by  incorpor¬ 
ating  additional  relevant  explanatory  variables.  Dummy  variables  are  a 
helpful  tool  to  remove  parameter  variation  by  incorporating  additional 
parameters  that  account  for  this  variation. 
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5.4  Heteroskedasticity 


5.4.1  Introduction 

General  model  for  heteroskedastic  error  terms 

For  ordinary  least  squares,  it  is  assumed  that  the  error  terms  of  the  model 
have  constant  variance  and  that  they  are  mutually  uncorrelated.  If  this  is  not 
the  case,  then  OLS  is  no  longer  efficient,  so  that  we  can  possibly  get  more 
accurate  esimates  by  applying  different  methods.  In  this  section  we  discuss 
the  estimation  and  testing  of  models  for  data  that  exhibit  heteroskedasticity, 
and  in  the  next  section  we  discuss  serial  correlation. 

Under  Assumptions  1-6,  the  standard  regression  model  is  given  by 


y  =  Xp  +  e,  E[s]  =  0,  £[ee']  =  a2I. 


In  this  section  we  suppose  that  Assumptions  1,  2,  4,  5,  and  6  are  satisfied  but 
that  Assumption  3  of  constant  variance  is  violated.  Let  the  disturbances  be 
heteroskedastic  with  E[ej ]  =  aj,  i  =  1,  •  •  • ,  n;  then 


E[se']  =  U  = 


(°\ 

O 

•  o  \ 

0 

■■ 

O  • 

•  •  O 

o  ■■ 

So  the  covariance  matrix  is  diagonal  because  of  Assumption  4  of  uncorrel¬ 
ated  disturbances,  but  the  elements  on  the  diagonal  may  be  different  for  each 
observation.  This  means  that  the  amount  of  randomness  in  the  outcome  of  y„ 
measured  by  var(y,)  =  of,  may  differ  for  each  observation. 

Implications  of  heteroskedasticity  for  estimation 

In  least  squares  we  minimize  Ym=\  (Vi  ~  x'iP)2>  but  h  the  variances  differ 
it  may  be  better  to  assign  relatively  smaller  weights  to  observations 
with  large  variance  and  larger  weights  to  observations  with  small  variance. 
This  is  because  observations  with  small  error  terms  provide  more  informa¬ 
tion  on  the  value  of  p  than  observations  with  large  error  terms.  We  can  then 
use  a  weighted  least  squares  criterion  of  the  form 
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J2wf(yi-x',P)\ 

i=  1 

with  weights  wj  that  decrease  for  larger  values  of  of.  The  choice  of  optimal 
weights  is  one  of  the  issues  discussed  below.  First  we  give  two  examples. 

Example  5.10:  Bank  Wages  (continued) 

We  consider  again  the  bank  wage  data  of  474  bank  employees.  We  will 
discuss  (i)  three  job  categories,  (ii)  a  possible  model  for  heteroskedasticity, 
and  (iii)  a  graphical  idea  of  the  amount  of  variation  in  wages. 

(i)  Three  job  categories 

The  bank  employees  can  be  divided  according  to  three  job  categories  — 
namely,  administrative  jobs,  custodial  jobs,  and  management  jobs.  It  may 
well  be  that  the  amount  of  variation  in  wages  differs  among  these  three 
categories.  For  instance,  for  a  given  level  of  education  it  may  be  expected 
that  employees  with  custodial  jobs  earn  more  or  less  similar  wages.  However, 
two  managers  with  the  same  level  of  education  may  have  quite  different 
salaries  —  for  instance,  because  the  job  responsibilities  differ  or  because  the 
two  employees  have  different  management  experience. 

(ii)  A  possible  model  for  heteroskedasticity 

We  consider  the  regression  model 

y,  =  i?l  +  l^2x‘  T  +  /?4 Dmi  +  P5D2 i  +  +  £/, 

where  yt  is  the  logarithm  of  yearly  wage,  xt  is  the  number  of  years  of  educa¬ 
tion,  Dg  is  a  gender  dummy  (1  for  males,  0  for  females),  and  D,„  is  a  minority 
dummy  (1  for  minorities,  0  otherwise).  Administration  is  taken  as  reference 
category  and  Di  and  D3  are  dummy  variables  (D2  =  1  for  individuals  with  a 
custodial  job  and  Di  =  0  otherwise,  and  D3  =  1  for  individuals  with  a 
management  position  and  D3  =  0  otherwise).  We  sort  the  observations  so 
that  the  first  n\  —  363  individuals  have  jobs  in  administration,  the  next 
n.2  =  27  ones  have  custodial  jobs,  and  the  last  «3  =  84  ones  have  jobs  in 
management.  If  we  allow  for  different  variances  among  the  three  job  categor¬ 
ies,  the  covariance  matrix  can  be  specified  as  follows,  where  Inj  denotes  the 
rii  x  rii  identity  matrix  for  i  =  1, 2, 3. 

[o\K  0  0  \ 

ft  =  0  a\ln%  0  . 

V  0  0  o\lnJ 
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Exhibit  5.16  Bank  Wages  (Example  5.10) 


(b) 


Unconditional  variation  in  log  salary  (a)  and  conditional  variation  of  residuals  of  log  salary 
(after  regression  on  education,  gender,  minority,  and  job  category  dummies  (b)).  The  job 
categories  are  administration  (1),  custodial  jobs  (2),  and  management  (3),  with  respective 
sizes  of  the  subsamples  «i  =  363,  «2  =  27,  and  M3  =  84. 


(iii)  Graphical  impression  of  the  amount  of  variation 

Exhibit  5.16  shows  for  each  job  category  both  the  (unconditional)  variation 
in  y  in  (a)  and  the  conditional  variation  (that  is,  the  variation  of  the  OLS 
residuals  of  the  above  regression  model)  in  (b).  The  exhibit  indicates  that  the 
variations  are  the  smallest  for  custodial  jobs. 

Example  5.1 1 :  Interest  and  Bond  Rates 

We  now  consider  monthly  data  on  the  short-term  interest  rate  (the  three- 
month  Treasury  Bill  rate)  and  on  the  AAA  corporate  bond  yield  in  the  USA. 
As  Treasury  Bill  notes  and  AAA  bonds  can  be  seen  as  alternative  ways  of 
investment  in  low-risk  securities,  it  may  be  expected  that  the  AAA  bond  rate 
is  positively  related  to  the  interest  rate.  It  may  further  be  that  this  relation 
holds  more  tightly  for  lower  than  for  higher  levels  of  the  rates,  as  for  higher 
rates  there  may  be  more  possibilities  for  speculative  gains.  We  will  discuss  (i) 
the  data  and  the  model,  (ii)  a  graphical  impression  of  changes  in  variance, 
and  (iii)  a  possible  model  for  heteroskedasticity. 

(i)  Data  and  model 

The  AAA  bond  rate  is  defined  as  an  average  over  long-term  bonds  of  firms 
with  AAA  rating.  The  data  on  the  Treasury  Bill  rate  are  taken  from  the 
Federal  Reserve  Board  of  Governors  and  the  data  on  AAA  bonds  from 
Moody’s  Investors  Service.  The  data  run  from  January  1950  to  December 
1999.  Let  x,  denote  the  monthly  change  in  the  Treasury  Bill  rate  and  let  y,  be 
the  monthly  change  in  the  AAA  bond  rate.  These  changes  will  be  related  to 
each  other,  and  we  postulate  the  simple  regression  model 
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y,  =  a  +  fix,  +  Si,  i=  1,  2,  -,  600. 

(ii)  Graphical  impression  of  changes  in  variance 

Exhibit  5.17  (a)  shows  the  residuals  that  are  obtained  from  regression  in  the 
above  model  (the  figure  has  time  on  the  horizontal  axis,  the  values  of  the 
residuals  are  measured  on  the  vertical  axis).  The  variance  over  the  first  half  of 
the  considered  time  period  is  considerably  smaller  than  that  over  the  second 
half.  This  suggests  that  the  uncertainty  of  AAA  bonds  has  increased  over 
time.  One  of  the  possible  causes  is  that  the  Treasury  Bill  rate  has  become 
more  volatile.  Exhibit  5.17  shows  two  scatter  diagrams  of  y,  against  x„  one 
(b)  for  the  first  300  observations  (1950-74)  and  the  other  (c)  for  the  last  300 
observations  (1975-99). 

(iii)  A  possible  model  for  heteroskedasticity 

The  magnitude  of  the  random  variations  e,  in  the  AAA  bond  rate  changes 
may  be  related  to  the  magnitude  of  the  changes  x,  in  the  Treasury  Bill  rate. 
For  instance,  if  E[sj ]  =  g2xj,  then  the  covariance  matrix  becomes 


where  n  =  600.  Observations  in  months  with  small  changes  in  the  Treasury 
Bill  rate  are  then  more  informative  about  a  and  /?  than  observations  in 
months  with  large  changes.  Alternative  models  for  the  variance  in  these 
data  will  be  considered  in  later  sections  (see  Examples  5.16  and  5.18). 


(a)  ( b )  (c) 


Exhibit  5.17  Interest  and  Bond  Rates  (Example  5.11) 


Plot  of  residuals  of  regression  of  changes  in  AAA  bond  rate  on  changes  in  three-month 
Treasury  Bill  rate  (a)  and  scatter  diagrams  of  these  changes  over  the  periods  1950-74  (b) 
and  1975-99  (c). 
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5.4.2  Properties  of  OLS  and  White  standard  errors 

Properties  of  OLS  for  heteroskedastic  disturbances 

Suppose  that  Assumptions  1,  2,  5,  and  6  are  satisfied  but  that  the  covariance 
matrix  of  the  disturbances  is  not  equal  to  a2I.  That  is,  assume  that 

y  =  Xj3  +  s,  E[e \  =  0,  E[ss']  =  ft. 

Although  ordinary  least  squares  will  no  longer  have  all  the  optimality 
properties  discussed  in  Chapter  3,  it  is  still  attractive,  as  it  is  simple  to 
compute  these  estimates.  In  this  section  we  consider  the  consequences  of 
applying  ordinary  least  squares  under  the  above  assumptions.  The  OLS 
estimator  is  given  by  b  =  (X'X)_1X'y,  and,  substituting  y  =  Xft  +  e,  it 
follows  that 


b  =  p  +  (X'X)-1X'e. 

Under  the  stated  assumptions  this  means  that 

E[b]  =  P,  var(b)  =  (X'X)_1X'ftX(X'X)_1.  (5.23) 

So  the  OLS  estimator  b  remains  unbiased.  However,  the  usual  expression 
er2(X'X)~ 1  for  the  variance  does  not  apply  anymore.  Therefore,  if  one  rou¬ 
tinely  applies  the  usual  least  squares  expressions  for  standard  errors,  then  the 
outcomes  misrepresent  the  correct  standard  errors,  unless  ft  =  o 2I.  So  the 
estimated  coefficients  b  are  ‘correct’  in  the  sense  of  being  unbiased,  but  the 
OLS  formulas  for  the  standard  errors  are  wrong. 

White  standard  errors 

In  order  to  perform  significance  tests  we  should  estimate  the  covariance 
matrix  in  (5.23).  If  the  disturbances  are  uncorrelated  but  heteroskedastic, 
so  that  ft  is  a  diagonal  matrix  with  elements  a2,  ■  ■  ■ ,  a2  on  the  diagonal,  then 
(5.23)  can  be  written  as 


var (b) 


(X'X)-1  £ 


ofx.x' 


(X'X) 


-1 


.  i=  1 


(5.24) 


Here  x,  is  the  k  x  1  vector  of  explanatory  variables  for  the  zth  observation. 
In  most  situations  the  values  of  of  the  variances  are  unknown.  A  simple 
estimator  of  of  is  given  by  e2,  the  square  of  the  OLS  residual  u,  =  y,  —  x\b. 
This  gives 
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var  (b) 


(x'xr1  £ 


ejxix': 


l(X'X) 


-1 


.  *=  1 


(5.25) 


This  is  called  the  White  estimate  of  the  covariance  matrix  of  b,  and  the  square 
roots  of  the  diagonal  elements  are  called  the  White  standard  errors.  Some¬ 
times  a  correction  is  applied.  In  case  of  homoskedastic  error  terms,  it  was 
derived  in  Section  3.1.5  (p.  127-8)  that  the  residual  vector  e  has  covariance 
matrix  <t2M,  where  M  =  I  —  X(X'X)~1XI  =  I  —  H.  So  the  residual  e,  has 
variance  cr2 ( 1  —  hj)  in  this  case,  where  b,  is  the  zth  diagonal  element  of  H. 
For  this  reason  one  sometimes  replaces  e2  in  (5.25)  by  e2/(l  —  h,). 


Proof  of  consistency  of  White  standard  errors 

Note  that,  even  in  the  homoskedastic  case  and  with  the  above  correction,  the 
estimator  ef/(l  —  hi)  of  the  variance  of  is  unbiased  but  not  consistent.  This  is 
because  only  a  single  observation  (the  zth )  has  information  about  the  value  of  of, 
so  that  by  increasing  the  sample  size  we  gain  no  additional  information  on  of. 
However,  we  will  now  show  that  the  estimator  (5.25)  of  the  covariance  matrix 
(5.24)  of  b  is  consistent,  provided  that 


E[sjXi]  =  E\(y,  -  x’fi)xi]  =  0. 

That  is,  the  orthogonality  conditions  should  be  satisfied.  This  is  also  required  for 
the  consistency  of  the  OLS  estimator  b.  To  prove  that  (5.25)  is  a  con¬ 
sistent  estimator  of  the  covariance  matrix  (5.24),  we  use  the  results  in  Section 
4.4.3  (p.  258)  on  GMM  estimators.  Note  that  the  GMM  estimator  for 
the  above  moment  conditions  is  equal  to  the  OLS  estimator  b  (see  Section  4.4.2 
(p.  252)).  The  above  moment  conditions  can  be  formulated  as  E[g,]  =  0  with 
gi  —  (y<  —  x'Ji)x,.  According  to  the  results  in  (4.67)  and  (4.68),  a  consistent  esti¬ 
mator  of  the  covariance  matrix  of  the  GMM  estimator  is  given  by 

var  (b)  =  (H'J-'H)-1, 

where  /  =  Y^l=i  g^i  and  H  =  Ym=\  bgi/bfi'  and  with  J  and  H  evaluated  at  b.  This 
means  that  J  =  e}xix\  and  H  =  —J2  xi x\  =  —X'X.  This  shows  that  (5.25)  is  the 
GMM  estimator  of  the  covariance  matrix,  which  is  consistent  (see  Section  4.4.3). 


Example  5.12:  Bank  Wages;  Interest  and  Bond  Rates  (continued) 

As  an  illustration  we  consider  the  two  examples  of  Section  5.4.1,  the  first  on 
wages  (see  Example  5.10)  and  the  second  on  interest  rates  (see  Example  5.11). 
Exhibit  5.18  shows  the  results  of  least  squares  with  conventional  OLS  formu¬ 
las  for  the  standard  errors  (in  Panels  1  and  3)  and  with  White  heteroskedas¬ 
ticity  consistent  standard  errors  (in  Panels  2  and  4).  For  most  coefficients, 
these  two  standard  errors  are  quite  close  to  each  other.  Note,  however,  that  for 
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Panel  1:  Dependent  Variable:  LOGSALARY 

Dependent  Variable:  LOGSALARY 

Method:  Least  Squares 

Sample:  1  474 

Included  observations:  474 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

9.574694 

0.054218 

176.5965 

0.0000 

EDUC 

0.044192 

0.004285 

10.31317 

0.0000 

GENDER 

0.178340 

0.020962 

8.507685 

0.0000 

MINORITY 

-0.074858 

0.022459 

-3.333133 

0.0009 

DUMJCAT2 

0.170360 

0.043494 

3.916891 

0.0001 

DUMJCAT3 

0.539075 

0.030213 

17.84248 

0.0000 

R-squared 

0.760775 

Panel  2:  Dependent  Variable:  LOGSALARY 

Method:  Least  Squares 

Sample:  1  474 

Included  observations:  474 

White  Heteroskedasticity-Consistent  Standard  Errors  Sc  Covariance 


Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

c 

9.574694 

0.054477 

175.7556 

0.0000 

EDUC 

0.044192 

0.004425 

9.987918 

0.0000 

GENDER 

0.178340 

0.019985 

8.923848 

0.0000 

MINORITY 

-0.074858 

0.020699 

-3.616538 

0.0003 

DUMJCAT2 

0.170360 

0.033025 

5.158477 

0.0000 

DUMJCAT3 

0.539075 

0.035887 

15.02147 

0.0000 

R-squared 

0.760775 

Panel  3:  Dependent  Variable:  DAAA 

Method:  Least  Squares 

Sample:  1950:01  1999:12 

Included  observations:  600 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

0.006393 

0.006982 

0.915697 

0.3602 

DUS3MT 

0.274585 

0.014641 

18.75442 

0.0000 

R-squared 

0.370346 

Panel  4:  Dependent  Variable:  DAAA 
Method:  Least  Squares 
Sample:  1950:01  1999:12 
Included  observations:  600 

White  Heteroskedasticity-Consistent  Standard  Errors  Sc  Covariance 


Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

0.006393 

0.006992 

0.914321 

0.3609 

DUS3MT 

0.274585 

0.022874 

12.00409 

0.0000 

R-squared 

0.370346 

Exhibit  5.18  Bank  Wages;  Interest  and  Bond  Rates  (Example  5.12) 

Regressions  for  wage  data  (Panels  1  and  2)  and  for  AAA  bond  rate  data  (Panels  3  and  4), 
with  conventional  standard  errors  (Panels  1  and  3)  and  with  White  standard  errors  (Panels  2 
and  4). 
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the  interest  rate  data  the  (consistent)  White  standard  error  of  the  slope  coeffi¬ 
cient  is  0.023,  whereas  according  to  the  conventional  OLS  formula  this 
standard  error  is  computed  as  0.015. 

“S?  Exercises:  S:  5.20a. 


5.4.3  Weighted  least  squares 

Models  for  the  variance 

The  use  of  OLS  with  White  standard  errors  has  the  advantage  that  no  model 
for  the  variances  is  needed.  However,  OLS  is  no  longer  efficient,  and  more 
efficient  estimators  can  be  obtained  if  one  has  reliable  information  on  the 
variances  of.  If  the  model  explaining  the  heteroskedasticity  is  sufficiently 
accurate,  then  this  will  increase  the  efficiency  of  the  estimators.  Stated  in 
general  terms,  a  model  for  heteroskedasticity  is  of  the  form 

of  =  H^y),  (5.26) 

where  h  is  a  known  function,  z  =  (1,  Zz,  •  •  • ,  Zp)'  is  a  vector  consisting 
of  p  observed  variables  that  influence  the  variances,  and  y  is  a  vector  of  p 
unknown  parameters.  Two  specifications  that  are  often  applied  are  the  model 
with  additive  heteroskedasticity  where  h(z'y)  =  zfy  and  the  model  with  multi¬ 
plicative  heteroskedasticity  where  h(z'y)  =  ezy.  The  last  model  has  the  ad¬ 
vantage  that  it  always  gives  positive  variances,  whereas  in  the  additive  model 
we  have  to  impose  restrictions  on  the  parameters  y. 

Weighted  least  squares 

A  particularly  simple  model  is  obtained  if  the  variance  depends  only  on  a 
single  variable  v  so  that 


2  2 

at  =  a  vh 

where  v,  >  0  is  known  and  where  a 2  is  an  unknown  scalar  parameter.  An 
example  is  the  regression  model  for  bond  rates  in  Example  5.11,  where  we 
proposed  the  model  of  =  <J2x2.  In  such  cases  we  can  transform  the  model 

y,  =  x'fi  +  e„  E[sj]  =  a2vl,  i  =  1,  •••,«, 

by  dividing  the  z'th  equation  by  y/vj.  Let  y*  =  y,/ yfv„  x*  =  and 

s*  =  Zi/yfvh  then  we  obtain  the  transformed  model 


y*  =  xfp  +  £*,  E[e*2]  —  a2,  i=l,---,n. 
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As  the  numbers  Vi  are  known,  we  can  calculate  the  transformed  data  y*  and 
x*.  Further,  if  the  original  model  satisfies  Assumptions  1,  2,  and  4-6,  then 
the  same  holds  true  for  the  transformed  model.  The  transformed  model  has 
also  homoskedastic  error  terms  and  hence  it  satisfies  Assumptions  1-6. 
Therefore,  the  best  linear  (in  y*)  unbiased  estimator  of  /I  is  obtained  by 
applying  least  squares  in  the  transformed  model.  To  derive  an  explicit  for¬ 
mula  for  this  estimator,  let  X*  be  the  n  x  k  matrix  with  rows  x*  and  let  y*  be 
the  n  x  1  vector  with  elements  y*.  Then  the  estimator  is  given  by 


1=1 


K  =  (Kxj-'Kv*  =  (Yx*x*  )  Yx*y * 

/  n  1  \  ^  /  n  1  ^ 

=  \Y-x>x'>)  \Y-x‘y> 


(5.27) 


v  /=1 


J=  1 


This  estimator  is  obtained  by  minimizing  the  criterion 

\2_  (yi  —  x'iPY 


i=  1 


s(P)  =  Y(y*-x*P)  =E 


i=i 


Vi 


(5.28) 


As  observations  with  smaller  variance  have  a  relatively  larger  weight  in 
determining  the  estimate  b *,  this  is  called  weighted  least  squares  (WLS). 
The  intuition  is  that  there  is  less  uncertainty  around  observations  with 
smaller  variances,  so  that  these  observations  are  more  important  for  estima¬ 
tion.  We  recall  that  in  Section  5.2.3  we  applied  weighted  least  squares  in  local 
regression,  where  the  observations  get  larger  weight  the  nearer  they  are  to  a 
given  reference  value. 


Illustration:  Heteroskedasticity  for  grouped  data 

In  research  in  business  and  economics,  the  original  data  of  individual 
agents  or  individual  firms  are  often  averaged  over  groups  for  privacy  reasons. 
The  groups  should  he  chosen  so  that  the  individuals  within  a  group  are 
more  or  less  homogeneous  with  respect  to  the  variables  in  the  model. 
Tet  the  individual  data  satisfy  the  model  y  =  Xfi  +  e  with  £[e]  =  0  and 
E[ee']  =  a2I  (that  is,  with  homoskedastic  error  terms).  Let  tij  be  the  number 
of  individuals  in  group  /',  then,  in  terms  of  the  reported  group  means,  the 
model  becomes 


y,  =  x',P  +  £/, 

where  and  e7  are  the  means  of  y7  and  e7  and  x-  is  the  row  vector  of  the  means 
of  the  explanatory  variables  in  group  j.  The  error  terms  satisfy 
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£[£/]  =  0,  £[r:2]  =  a2 /rij  and  £[£,£/,]  =  0  for  /  ^  h,  so  that  grouping  leads  to 
heteroskedastic  disturbances  with  covariance  matrix 

0  ■■■  0  \ 

h1  ■■■  o 

.  .  .  9 

0  ■  ■  ■  ) 

where  G  denotes  the  number  of  groups.  The  WLS  estimator  is  given  by  (5.27) 
with  Vj  =  1  /tij,  so  that 

bwu  =  ^XJ  W/ 

The  weighting  factors  show  that  larger  groups  get  larger  weights. 


Q  =  a2 


(  nil 
0 


V  o 


Statistical  properties  of  the  WLS  estimator 

The  properties  of  the  weighted  least  squares  estimator  are  easily  obtained 
from  the  transformed  model.  The  covariance  matrix  of  b *  is  given  by 

var (£>*)  =  (72(X*X*)_1  =  er2  ■  (5-29) 

The  weighted  least  squares  estimator  is  efficient,  and  hence  its  covariance 
matrix  is  smaller  than  that  of  the  OTS  estimator  in  (5.24)  (see  also  Exercise 
5.5).  In  terms  of  the  residuals  of  the  transformed  model 

=  y*  x*£?*, 

an  unbiased  estimator  of  the  variance  a2  is  given  by 


If  we  add  Assumption  7  that  the  disturbance  terms  are  normally  distributed, 
then  the  results  of  Chapter  3  on  testing  linear  hypotheses  can  be  applied 
directly  to  the  transformed  model.  For  instance,  the  f- test  of  Chapter  3  now 
becomes 


( 'I2e*R « -  YA)  is  _  f  -  XZ  |)  /g 

X>2 /(«-£) 


(5.30) 
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where  e  =  y  —  Xb *,  e*R  =  y*  —  X*b*R,  and  eR  =  y  —  Xb*R  with  b*R  the  re¬ 
stricted  ordinary  least  squares  estimator  in  the  transformed  model. 

□  Asymptotic  properties  of  WLS 

The  asymptotic  results  in  Chapter  4  can  be  applied  directly  to  the  transformed 
model.  For  instance,  if  we  drop  Assumption  1  of  fixed  regressors  and  Assumption 
7  of  normally  distributed  error  terms,  then 

V*(b.-p)±N(0,o2Q:1),  (5.31) 

that  is,  WLS  is  consistent  and  has  an  asymptotic  normal  distribution,  under  the 
conditions  that 


plim 


=  Q* 


plim 


=  0. 


(5.32) 

(5.33) 


Under  these  assumptions,  expressions  like  (5.29)  and  (5.30)  remain  valid  asymp¬ 
totically. 


Summary  of  estimation  by  WLS 

Estimation  by  weighted  least  squares  can  be  performed  by  means  of  the 
following  steps. 


Weighted  least  squares 

•  Step  1:  Formulate  the  model.  Formulate  the  model  regression  model 
y,  =  x'jP  +  £i  and  the  model  for  the  variances  E[sf]  =  er2u,,  where  (y„ x,) 
are  observed  and  v,  are  known,  i  =  1,  •  •  • ,  n,  and  where  f>  and  er2  are 
unknown  fixed  parameters. 

•  Step  2:  Transform  the  data.  Transform  the  observed  data  y,-  and  x,  by 
dividing  by  yji 7,  to  get  y*  =  -j^y,  and  x*  =  ^x,. 

•  Step  3:  Estimate  and  test  with  transformed  data.  Apply  the  standard 
procedures  for  estimation  and  testing  of  Chapters  3  and  4  on  the  trans¬ 
formed  data  y*,  X*. 

•  Step  4:  Transform  results  to  original  data.  The  results  can  be  rewritten  in 
terms  of  the  original  data  by  substituting  y,-  =  y/Viy*  and  x,  =  ^/v,x*. 


We  illustrate  this  with  two  examples. 
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Example  5.13:  Bank  Wages  (continued) 

In  this  example  we  continue  our  previous  analysis  of  the  bank  wage  data.  We 
consider  the  possible  heteroskedasticity  that  results  by  grouping  the  data.  We 
will  discuss  (i)  the  grouped  data,  and  (ii)  the  results  of  OLS  and  WLS  for  the 
grouped  data. 

(i)  Grouped  bank  wage  data 

Suppose  that  for  privacy  reasons  the  individual  bank  wage  data  are  grouped 
according  to  the  variables  gender,  minority,  job  category,  and  four  education 
groups  (10  years  or  less,  between  11  and  13  years,  between  14  and  16  years, 
and  17  years  or  more).  In  principle  this  gives  2x2x3x4  =  48  groups. 
However,  twenty-two  combinations  do  not  occur  in  the  sample,  so  that 
G  =  26  groups  remain.  Exhibit  5.19  shows  a  histogram  of  the  resulting 
group  sizes.  Some  groups  consist  of  a  single  individual,  and  the  largest  group 
contains  101  individuals.  It  is  intuitively  clear  that  the  averaged  data  in  this 
large  group  should  be  given  more  weight  than  the  data  in  the  small  groups. 

(ii)  Results  of  OLS  and  WLS  for  grouped  data 

Exhibit  5.20  shows  the  result  of  applying  OLS  to  the  grouped  data,  both  with 
OLS  standard  errors  (in  Panel  1)  and  with  White  standard  errors  (in  Panel  2), 
and  efficient  WLS  estimates  are  reported  in  Panel  3.  The  WLS  estimates  are 
clearly  different  from  the  OLS  estimates  and  the  standard  errors  of  WLS  are 
considerably  smaller  than  those  of  OLS.  Lor  WLS,  the  R 2  and  the  standard 
error  of  regression  are  reported  both  for  weighted  data  (based  on  the 
residuals  e*  =  y*  —  of  step  3  of  WLS)  and  for  unweighted  data  (based 
on  the  residuals  e  =  y  —  Xb*  of  step  4  of  WLS). 


Exhibit  5.19  Bank  Wages  (Example  5.13) 

Grouped  data  of  474  employees,  with  groups  defined  by  gender,  minority,  job  category,  and 
four  education  groups.  The  histogram  shows  the  sizes  of  the  resulting  twenty-two  groups  of 
employees  (the  group  size  is  measured  on  the  horizontal  axis,  and  the  vertical  axis  measures  the 
frequency  of  occurrence  of  the  group  sizes  in  the  indicated  intervals  on  the  horizontal  axis). 


XM513BWA 


332  5  Diagnostic  Tests  and  Model  Adjustments 


Panel  1:  Dependent  Variable:  MEANLOGSAL 

Method:  Least  Squares 

Sample(adjusted):  1  26 

Included  observations:  26  after  adjusting  endpoints 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

9.673440 

0.141875 

68.18272 

0.0000 

MEANEDUC 

0.033592 

0.010022 

3.351783 

0.0032 

GENDER 

0.249522 

0.074784 

3.336567 

0.0033 

MINORITY 

-0.024444 

0.062942 

-0.388348 

0.7019 

DUMJCAT2 

0.019526 

0.090982 

0.214610 

0.8322 

DUMJCAT3 

0.675614 

0.084661 

7.980253 

0.0000 

R-squared 

0.886690 

S.E.  of  regression 

0.157479 

Panel  2:  Dependent  Variable:  MEANLOGSAL 

Method:  Least  Squares 

Sample(adjusted):  1  26 

Included  observations:  26  after  adjusting  endpoints 

White  Eleteroskedasticity-Consistent  Standard  Errors  &  Covariance 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

9.673440 

0.125541 

77.05376 

0.0000 

MEANEDUC 

0.033592 

0.009617 

3.492757 

0.0023 

GENDER 

0.249522 

0.053352 

4.676939 

0.0001 

MINORITY 

-0.024444 

0.060389 

-0.404766 

0.6899 

DUMJCAT2 

0.019526 

0.102341 

0.190792 

0.8506 

DUMJCAT3 

0.675614 

0.104891 

6.441090 

0.0000 

R-squared 

0.886690 

S.E.  of  regression 

0.157479 

Panel  3:  Dependent  Variable:  MEANLOGSAL 

Method:  Least  Squares 

Sample(adjusted):  1  26 

Included  observations:  26  after  adjusting  endpoints 

Weighting  series:  sq.root  group  size  (v;  =  1  /n,  with  n,  the  group  size) 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

9.586344 

0.077396 

123.8604 

0.0000 

MEANEDUC 

0.043238 

0.006123 

7.061221 

0.0000 

GENDER 

0.179823 

0.029525 

6.090510 

0.0000 

MINORITY 

-0.074960 

0.031581 

-2.373596 

0.0277 

DUMJCAT2 

0.166985 

0.061281 

2.724918 

0.0130 

DUMJCAT3 

0.542568 

0.042672 

12.71483 

0.0000 

Weighted  Statistics 

R-squared 

0.999903 

S.E.  of  regression 

0.077288 

Unweighted  Statistics 

R-squared 

0.834443 

S.E.  of  regression 

0.190354 

Exhibit  5.20  Bank  Wages  (Example  5.13) 

Regressions  for  grouped  wage  data,  OLS  (Panel  1),  OLS  with  White  standard  errors  (Panel  2), 
and  WLS  with  group  seizes  as  weights  (Panel  3).  In  Panel  3,  the  weighted  statistics  refer  to 
the  transformed  data  (with  weighted  observations)  and  the  unweighted  statistics  refer  to  the 
observed  (unweighted)  data. 
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Example  5.14:  Interest  and  Bond  Rates  (continued) 

We  continue  our  analysis  of  the  interest  and  bond  rate  data  introduced  in 
Example  5.11  in  Section  5.4.1.  We  will  discuss  (i)  the  application  of  weighted 
least  squares  in  this  model,  (ii)  the  outcomes  of  OLS  and  WLS,  and  (iii) 
comments  on  the  outcomes. 

(i)  Application  of  weighted  least  squares 

In  Example  5.11  in  Section  5.4.1  we  considered  the  regression  model 
y,  =  a  +  fix,  +  s,  for  the  relation  between  changes  in  the  AAA  bond  rate  y, 


Panel  1:  Dependent  Variable:  DAAA 

Method:  Least  Squares 

Sample:  1950:01  1999:12 

Included  observations:  600 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

c 

0.006393 

0.006982 

0.915697 

0.3602 

DUS3MT 

0.274585 

0.014641 

18.75442 

0.0000 

R-squared 

0.370346 

S.E.  of  regression 

0.171002 

(b) 


Panel  2:  Dependent  Variable:  DAAA 
Method:  Least  Squares 
Sample:  1950:01  1999:12 
Included  observations:  583 
Excluded  observations:  17 


Weighting  series:  1/DUS3MT  (v;  =  (DUS3MTQ2) 


Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-0.002380 

0.005143 

-0.462794 

0.6437 

DUS3MT 

0.262260 

0.144280 

1.817717 

0.0696 

Weighted  Statistics 
R-squared 

0.000369 

S.E.  of  regression 

7.381207 

Unweighted  Statistics 
R-squared 

0.370293 

S.E.  of  regression 

0.172944 

(c) 


Series:  DAAA 
Sample  1950:01  1999:12 
with  DUS3MT  =  0 
Observations  17 


Mean 

Median 

Maximum 

Minimum 

Std.  Dev. 

Skewness 

Kurtosis 


0.006471 

0.010000 

0.160000 

-0.320000 

0.104758 

-1.676769 

6.841560 


Exhibit  5.21  Interest  and  Bond  Rates  (Example  5.14) 


Regressions  for  AAA  bond  rate  data,  OLS  (Panel  1)  and  WLS  (with  variances  proportional  to 
the  square  of  DUS3MT,  Panel  2).  ( c )  shows  the  histogram  of  the  values  of  DAAA  in  the 
seventeen  months  where  DUS3MT  =  0  (these  observations  are  excluded  in  WLS  in  Panel  2). 


0*,A«'c 

t  k  : 


XM511IBR 


334  5  Diagnostic  Tests  and  Model  Adjustments 


and  the  three-month  Treasury  Bill  rate  x,.  The  plots  in  Exhibit  5.17  suggest 
E[e2]  =  a 2xj  as  a  possible  model  for  the  variances.  The  WLS  estimator  (5.27) 
is  obtained  by  ordinary  least  squares  in  the  transformed  model 


yj_ 

X, 


a  ■  — h  p  +  s* , 
x, 


where  the  error  terms  s*  =  e,/x,  are  homoskedastic  with  E[s*2]  =  a1. 


(ii)  Outcomes  of  OLS  and  WLS 

Exhibit  5.21  shows  the  results  of  OLS  in  the  original  model  (in  Panel  1)  and 
of  WLS  (in  Panel  2).  Note  that,  according  to  the  WLS  outcomes  and  at  the 
5  per  cent  significance  level,  the  Treasury  Bill  rate  changes  (x,)  provide  no 
significant  explanation  of  AAA  rate  changes  {yt). 


(iii)  Comments  on  the  outcomes 

Panel  2  of  Exhibit  5.21  indicates  that,  for  WLS,  17  of  the  n  —  600  observa¬ 
tions  are  dropped.  This  is  because  in  these  months  x(-  =  0.  This  indicates  a 
shortcoming  of  the  model  for  the  variance,  as  for  x,  =  0  the  model  postulates 
that  var(y,)  =  E[rf  ]  =  a2x2  =  0.  In  reality  this  variance  is  non-zero,  as  the 
AAA  rate  does  not  always  remain  fixed  in  months  where  the  Treasury  Bill 
rate  remains  unchanged  (see  the  histogram  in  Exhibit  5.21  (c)).  In  the  next 
section  we  will  consider  alternative,  less  restrictive  models  for  the  variance  of 
the  disturbances  (see  Example  5.16). 

^  Exercises:  T:  5.5;  S:  5.20b-e;  E:  5.33c,  e. 


5.4.4  Estimation  by  maximum  likelihood  and  feasible  WLS 

Maximum  likelihood  in  models  with  heteroskedasticity 

The  application  of  WLS  requires  that  the  variances  of  the  disturbances  are 
known  up  to  a  scale  factor  —  that  is,  of  =  a2v,  with  a2  an  unknown  scalar 
parameter  and  with  v,  known  for  all  i  =  1,  ■  ■  ■ ,  n.  If  we  are  not  able  to  specify 
such  a  type  of  model,  then  we  can  use  the  more  general  model  (5.26)  with 
variances  of  =  b(z!iy),  where  y  contains  p  unknown  parameters.  Under  As¬ 
sumptions  1,  2,  and  4-7,  the  log-likelihood  (in  terms  of  the  (k  +  p)  unknown 
parameters  ft  and  y)  is  given  by 


y)  =  -^log(27t)  -i^log(6(z-y)) 

“  i=l 


1  {ji  x'iP)~ 

W) 


(5.34) 
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The  ML  estimators  of  p  and  y  are  obtained  by  maximizing  l(P,  y),  and  these 
estimators  have  the  usual  optimal  asymptotic  properties.  For  given  values  of 
y,  the  optimal  values  of  p  are  obtained  by  WLS,  replacing  v,  in  (5.27)  by 
h(z'j iy),  so  that 


bwLs(y)  =  (Y2~i  =  (5.35) 

This  estimator  is  not  ‘feasible’  —  that  is,  it  cannot  be  computed  because  y  is 
unknown.  Fiowever,  we  can  substitute  this  formula  for  bwLS  in  (5.34)  to 
obtain  the  concentrated  log-likelihood  as  a  function  of  y  alone.  Then  y  can  be 
estimated  by  maximizing  this  concentrated  log-likelihood  and  the  corres¬ 
ponding  estimate  of  ft  follows  from  (5.35). 

Feasible  weighted  least  squares 

An  alternative  and  computationally  simpler  estimation  method  is  to 
use  a  two-step  approach.  In  the  first  step  the  variance  parameters  y  are 
estimated,  and  in  the  second  step  the  regression  parameters  p  are  estimated, 
using  the  estimated  variances  of  the  first  step.  This  method  is  called  (two- 
step)  feasible  weighted  least  squares  (FWLS). 


Two-step  feasible  weighted  least  squares 


•  Step  1:  Estimate  the  variance  parameters.  Determine  an  estimate  c  of 
the  variance  parameters  y  in  the  model  var(e,-)  =  h{z'iy)  and  define  the 
estimated  variances  by  sj  =  h{z'jC). 

•  Step  2:  Apply  WLS  with  the  estimated  variances.  Compute  the  feasible 
weighted  least  squares  estimates 


bpWLS  = 


(5.36) 


Derivation  of  statistical  properties  of  FWLS 

The  properties  of  the  estimator  bpwLS  depend  on  those  of  the  used  estimator  c  of  y 
in  step  1.  To  investigate  the  consistency  and  the  asymptotic  distribution  of  bpwLS , 
we  write  fL  for  the  n  x  n  diagonal  matrix  with  elements  of  =  h(z\y)  and  Elc  for  the 
n  x  n  diagonal  matrix  with  elements  sj  =  h{z'ic).  Writing  the  model  y,-  =  x'fi  +  e, 
in  matrix  form  y  =  Xp  +  s,  we  get 


bFWLS  ~  bWLs(y)  =  (X'Sl:1X)-1X'Cl-1y  - 

=  (X'Slf'xr'X'i i^e  -  (X'ft^X^X'ft^e. 
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Therefore,  has  the  same  asymptotic  distribution  (5.31)  as  b^LS  (and  hence  it 
is  consistent  and  asymptotically  efficient)  provided  that 

plim(»x'(Ii7  “  ft~‘)x) =  plim(;E(^  -  =  °> 

Plim(-i=x'(n;1-n,7,)e)  =  P,im(-i=|:(i-i)«)  =o 

Under  some  regularity  conditions  on  the  regressors  x,  and  the  function  h  in  (5.26), 
the  above  two  conditions  are  satisfied  if  c  is  a  consistent  estimator  of  y.  If  c  is 
consistent,  then  the  FWLS  estimator  has  the  same  asymptotic  covariance  matrix  as 
the  WLS  estimator.  Under  conditions  (5.37),  (5.38),  and  (5.32),  (5.33),  we  can  use 
the  following  result  as  an  approximation  in  finite  samples. 


(5.37) 

(5.38) 


Approximate  distribution  of  the  FWLS  estimator 

Under  the  above  conditions,  in  particular  consistency  of  the  estimator  c  of 
the  variance  parameters  y,  there  holds 


brwLs  ~  n(/i,  (x'n^xr1). 


Here  c  is  the  estimate  of  y  obtained  in  step  1  of  FWLS,  and  Clc  is  the 
corresponding  diagonal  matrix  with  the  estimated  variances  sj  =  b(z'ic)  on 
the  diagonal.  So  the  covariance  matrix  of  the  FWLS  estimator  can  be  ap¬ 
proximated  by 


van (bFwLs)  =  (X'Clc  xX)  1  =  f 

\i=  1  Si 

If  one  wants  to  use  WLS  with  chosen  weighting  factors  sj  but  one  is  uncertain 
whether  these  weights  correspond  to  the  actual  variances,  then  the  above 
formula  for  the  variance  is  in  general  no  longer  correct.  In  this  case  consistent 
estimates  of  the  standard  errors  can  be  obtained  by  GMM.  This  corresponds 
to  the  White  standard  errors  of  OLS  after  the  observations  (y,,  x,)  have  been 
transformed  to  (y*,  x*),  where  y*  =  y,/sj  and  x*  =  jx,-. 


Two-step  FWLS  in  the  additive  and  multiplicative  model 

The  foregoing  shows  that  the  two-step  FWLS  estimator  is  asymptotically 
equally  efficient  as  WLS,  provided  that  the  estimator  c  in  step  1  is  a  consistent 
estimator  of  y.  We  consider  this  for  the  additive  and  the  multiplicative  model 
for  heteroskedasticity.  In  both  cases,  first  OLS  is  applied  in  the  model 
y  =  +  a  with  residuals  e.  If  b  is  consistent,  the  squared  residuals  ej  are 
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asymptotically  unbiased  estimates  of  of.  Then  y  is  estimated  by  replacing  of 
by  ej  and  by  running  the  regression 

ef  =  zriy  +  >1, 

in  the  additive  model,  and  in  the  multiplicative  model 

lo§(e?)  =  Ay  +  »/*•• 

The  error  terms  are  given  by  =  e2  —  of  in  the  additive  model  and  by 
>/,-  =  log  (ef/af)  in  the  multiplicative  model.  It  is  left  as  an  exercise  (see 
Exercise  5.6)  to  show  that  the  above  regression  for  the  additive  model  gives 
a  consistent  estimate  of  y,  but  that  in  the  multiplicative  model  a  correction 
factor  is  needed.  In  the  latter  model  the  coefficients  y  -  of  the  variables  Zj  are 
consistently  estimated  for  /  =  2, . . . ,  p,  but  the  coefficient  y1  of  the  constant 
term  z\  =  1  should  be  estimated  as  y-,  +  a,  where  yt  is  the  OLS  estimate  of  yx 
and  a  =  —  E[  log  (/2(1))J  ~  1.27. 

Iterated  FWLS 

Instead  of  the  above  two-step  FWLS  method,  we  can  also  apply  iterated 
FWLS.  In  this  case  the  FWLS  estimate  of  ft  in  step  2  is  used  to  construct  the 
corresponding  series  of  residuals,  which  are  used  again  in  step  1  to  determine 
new  estimates  of  the  heteroskedasticity  parameters  y.  The  newly  estimated 
variances  are  then  used  in  step  2  to  compute  the  corresponding  new  FWLS 
estimates  of  ft.  This  is  iterated  until  the  parameter  estimates  converge.  These 
iterations  can  improve  the  efficiency  of  the  FGLS  estimator  in  finite  samples. 

Example  5.15:  Bank  Wages  (continued) 

We  consider  the  bank  wage  data  again  and  will  discuss  (i)  a  multiplicative 
model  for  heteroskedasticity,  (ii)  the  two-step  FWLS  estimates  of  this  model, 
and  (iii)  the  ML  estimates. 

(i)  A  multiplicative  model  for  heteroskedasticity 

In  Example  5.10  we  considered  the  regression  model 

Ji  —  P  l  +  Pzxi  +  @3  Dgj  +  /J4D„„  +  PsDn  +  P&D3i  +  £«• 

We  concluded  that  the  unexplained  variation  £,  in  the  (logarithmic)  salaries 
may  differ  among  the  three  job  categories  (see  Exhibit  5.16).  Suppose  that 
the  disturbance  terms  e,  in  the  above  regression  model  have  variances  a\,  a\, 
or  a\  according  to  whether  the  ith  employee  has  a  job  in  category  1,  2, 
or  3  respectively.  Let  the  parameters  be  transformed  by  y4  =  log  (n2), 


E 
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y2  =  log  (o\/o\),  and  y3  =  log  (o\/o\)-,  then  we  can  formulate  the  following 
multiplicative  model  for 

a]  =  E[sf]  =  e','+'2D^+>,3D3«. 

(ii)  Two-step  FWLS  estimates 

To  apply  (two-step)  FWLS,  the  parameters  of  this  model  for  the  variances  are 
estimated  in  Panels  1  and  2  of  Exhibit  5.22.  In  Panel  2  the  explained  variable 


Panel  1:  Dependent  Variable:  LOGSALARY 

Method:  Least  Squares 

Sample:  1  474 

Included  observations:  474 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

9.574694 

0.054218 

176.5965 

0.0000 

EDUC 

0.044192 

0.004285 

10.31317 

0.0000 

GENDER 

0.178340 

0.020962 

8.507685 

0.0000 

MINORITY 

-0.074858 

0.022459 

-3.333133 

0.0009 

DUMJCAT2 

0.170360 

0.043494 

3.916891 

0.0001 

DUMJCAT3 

0.539075 

0.030213 

17.84248 

0.0000 

R-squared 

0.760775 

Panel  2:  Dependent  Variable:  LOG(RESOLSA2) 

Method:  Least  Squares 

Sample:  1  474 

Included  observations:  474 


Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

c 

-4.733237 

0.123460 

-38.33819 

0.0000 

DUMJCAT2 

-0.289197 

0.469221 

-0.616335 

0.5380 

DUMJCAT3 

0.460492 

0.284800 

1.616892 

0.1066 

R-squared 

0.006882 

Panel  3:  Dependent  Variable:  LOGSALARY 

Method:  Least  Squares 

Sample:  1  474 

Included  observations:  474 

Weighting  series:  1/STDEV  (vj  = 

(STDEVj)2  obtained  from  Panel  2) 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

9.595652 

0.052207 

183.8011 

0.0000 

EDUC 

0.042617 

0.004128 

10.32413 

0.0000 

GENDER 

0.178389 

0.020391 

8.748212 

0.0000 

MINORITY 

-0.077864 

0.021358 

-3.645626 

0.0003 

DUMJCAT2 

0.166836 

0.037321 

4.470278 

0.0000 

DUMJCAT3 

0.545375 

0.032659 

16.69933 

0.0000 

Weighted  Statistics 

R-squared 

0.936467 

Unweighted  Statistics 

R-squared 

0.760688 

Exhibit  5.22  Bank  Wages  (Example  5.15) 

OLS  for  wage  data  (Panel  1),  step  1  of  FWLS  (Panel  2,  auxiliary  regression  of  OLS  residuals  for 
estimation  of  the  variance  parameters  in  the  multiplicative  model  of  heteroskedasticity),  and 
step  2  of  FWLS  (Panel  3,  WLS  with  estimated  variances  obtained  from  Panel  2). 
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Panel  4:  Dependent  Variable:  LOGSALARY 

Method:  Maximum  Likelihood  (BHHH),  multiplicative  heteroskedasticity 
Sample:  1  474 

Included  observations:  474 

Evaluation  order:  By  observation 

Convergence  achieved  after  76  iterations 

Variable 

Coefficient 

Std.  Error 

z-Statistic 

Prob. 

Constant 

9.629294 

0.053441 

180.1839 

0.0000 

EDUC 

0.039782 

0.004162 

9.559227 

0.0000 

GENDER 

0.182140 

0.021259 

8.567533 

0.0000 

MINORITY 

-0.072756 

0.023355 

-3.115197 

0.0018 

DUMJCAT2 

0.155865 

0.036379 

4.284448 

0.0000 

DUMJCAT3 

0.557101 

0.034005 

16.38289 

0.0000 

Variance  Equation 

Constant  (yj) 

-3.342117 

0.065795 

-50.79576 

0.0000 

DUMJCAT2  (y2) 

-0.867102 

0.259368 

-3.343140 

0.0008 

DUMJCAT3  (y3) 

0.452073 

0.173538 

2.605030 

0.0092 

Log  likelihood 

112.2237 

Exhibit  5.22  ( Contd .) 

ML  estimates  of  model  for  wages  with  multiplicative  model  for  heteroskedasticity 
(Panel  4,  starting  values  at  FWLS  estimates). 

is  log  (ej),  where  e,  are  the  OLS  residuals  of  the  regression  in  Panel  1.  With 
the  correction  factor  for  multiplicative  models,  the  variances  are  estimated 

by  sj  =  g1-27+fi+ii2P)2,'+5'3I,3i  —  that  s2  _  gl-27+y1?  s2  =  sjeh  antJ  s|  _  s2ey3_ 

The  results  in  Panel  2  give  the  following  estimates  of  the  standard  deviations 
per  job  category. 

St  =  V e127~4-733  =  0.177,  s2  =  SlVe~0-289  =  0.153, 
s3  =SlVe0A60  =  0.223. 

As  expected,  the  standard  deviation  is  smallest  for  custodial  jobs  and  it 
is  largest  for  management  jobs.  The  corresponding  (two-step)  FWLS  estima¬ 
tor  in  (5.36)  is  given  in  Panel  3  of  Exhibit  5.22.  The  outcomes  are  quite  close 
to  those  of  OLS,  so  that  the  effect  of  heteroskedasticity  is  relatively  small. 
Moreover,  the  estimates  y2  and  73  are  not  significant,  indicating  that  the 
homoskedasticity  of  the  error  terms  need  not  be  rejected. 

(iii)  ML  estimates 

Panel  4  of  Exhibit  5.22  shows  the  results  of  ML.  The  ML  estimates  of  the 
parameters  of  the  regression  equation  are  close  to  the  (two-step)  FWLS 
estimates.  However,  the  ML  estimates  of  the  variance  parameters  y1?  y2, 
and  y3  are  quite  different  from  those  obtained  in  the  (two-step)  FWLS 
method.  In  particular,  the  ML  estimates  of  the  parameters  y2  and  y3  differ 
significantly  from  zero.  That  is,  the  ML  results  indicate  significant  hetero¬ 
skedasticity  between  the  three  job  categories.  As  the  ML  estimates  are 
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efficient,  this  leads  to  sharper  conclusions  than  FWLS  (where  the  null  hy¬ 
pothesis  of  homoskedasticity  could  not  be  rejected). 

Example  5.16:  Interest  and  Bond  Rates  (continued) 

We  continue  our  analysis  of  the  interest  and  bond  rate  data  of  Example  5.14. 
The  model  E[sj ]  =  o2x2  that  was  analysed  in  that  example  did  not  turn  out  to 
be  very  realistic.  We  will  discuss  (i)  two  alternative  models  for  the  variance, 
(ii)  two-step  FWLS  and  ML  estimates  of  both  models,  and  (iii)  our  conclusion. 

(i)  Two  alternative  models  for  the  variance 

We  consider  again  the  relation  between  monthly  changes  in  AAA  bond  rates 
(y,)  and  monthly  changes  in  Treasury  Bill  rates  (xy)  given  by 

y,  =  a  +  fix,  +  Si,  E[sf]  =  erf 

In  Section  5.4.3  we  considered  WLS  with  the  model  of  =  a2x2  for  the 
variances  and  we  concluded  that  this  model  has  its  shortcomings.  Exhibit 
5.17  shows  that  the  variance  in  the  period  1950-74  is  smaller  than  that  in  the 
period  1975-99.  This  can  be  modelled  by 

°f  =  y  i  +  72  Di, 

where  D,  is  a  dummy  variable  with  D,  =  0  in  the  months  50.01-74.12  and 
D,  =  1  in  the  months  75.01-99.12.  In  this  model  the  variance  is  until  1974 
and  it  becomes  +  y2  from  1975  onwards.  However,  as  is  clear  from  Exhibit 
5.17  (a),  the  variance  is  also  changing  within  these  two  subperiods.  In  general, 
large  residuals  tend  to  be  followed  by  large  residuals,  and  small  residuals  by 
small  ones.  A  model  for  this  kind  of  clustered  variances  is  given  by 

=  7i  +  72^-1  =  7i  +  ViiVi-i  -  a  -  P*i- 1)2- 


(ii)  Two-step  FWLS  and  ML  estimates 

Exhibit  5.23  shows  the  results  of  two-step  FWLS  and  ML  estimates  for  both 
heteroskedasticity  models.  The  two-step  FWLS  estimates  are  obtained  as 
follows.  In  the  first  step,  y,  is  regressed  on  x,  with  residuals 
e,  =  yi  —  a  —  bx,  (see  Panels  1  and  5).  In  the  second  step,  for  the  dummy 
variable  model  we  perform  the  regression  (see  Panel  2) 

e]  =  7t  +  72  A'  +  >1n 

and  for  the  model  with  clustered  variances  we  perform  the  regression  (see 
Panel  6) 
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Panel  1:  Dependent  Variable:  DAAA 

Method:  Least  Squares 

Sample:  1950:01  1999:12 

Included  observations:  600 

Variable 

Coefficient  Std.  Error 

t-Statistic 

Prob. 

C 

0.006393  0.006982 

0.915697 

0.3602 

DUS3MT 

0.274585  0.014641 

18.75442 

0.0000 

Panel  2:  Dependent  Variable:  RESOLSA2 
Method:  Least  Squares 
Sample:  1950:01  1999:12 
Included  observations:  600 

Variable  Coefficient  Std.  Error  t-Statistic  Prob. 

C  0.009719  0.004374  2.222044  0.0267 

DUM7599  0.038850  0.006186  6.280616  0.0000 


Panel  3:  Dependent  Variable:  DAAA 
Method:  Least  Squares 
Sample:  1950:01  1999:12 
Included  observations:  600 

Weighting  series:  1/STD EV  (v;  =  (STDEV;)2=  fitted  value  of  Panel  2) 


Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

DUS3MT 

0.013384 

0.214989 

0.005127 

0.014079 

2.610380 

15.27018 

0.0093 

0.0000 

Panel  4:  Dependent  Variable  :  DAAA 

Method:  Maximum  Likelihood  (BHHH),  dummy  model  heteroskedasticity 
Sample:  1950:01  1999:12 
Included  observations:  600 
Convergence  achieved  after  1 8  iterations 


Variable 

Coefficient 

Std.  Error 

z-Statistic 

Prob. 

Constant 

0.014083 

0.005036 

2.796224 

0.0052 

DUS3MT 

0.205870 

0.010699 

19.24227 

0.0000 

Variance  Equation 

Constant  (yj) 

0.008413 

0.000393 

21.38023 

0.0000 

DUM7599  (y2) 

0.043714 

0.002960 

14.76792 

0.0000 

Exhibit  5.23  Interest  and  Bond  Rates  (Example  5.16) 

OLS  of  AAA  bond  rate  on  Treasury  Bill  rate  (Panel  1),  step  1  of  FWLS  (Panel  2,  auxiliary 
regression  of  squared  residuals  to  estimate  dummy  model  of  heteroskedasticity),  step  2  of 
FWLS  (Panel  3,  WLS  with  estimated  variances  obtained  from  Panel  2),  and  ML  with  dummy 
model  for  heteroskedasticity  (Panel  4). 

e]  =  7i  +  Vief- 1  +  Vi- 

The  estimated  variances,  of  =  y i  +  y2TL  in  the  first  model  and  of  = 
7i  +72e?-i  in  the  second  model,  are  then  used  to  compute  the  (two-step) 
FWLS  estimates  (5.36)  of  a  and  /?  (see  Panels  3  and  7).  The  results  in  Panels  4 
and  8  of  Exhibit  5.23  show  that  the  standard  errors  of  the  ML  estimates  are 
smaller  than  those  of  the  FWLS  estimates.  For  instance,  the  estimated  slope 
parameter  /i  in  the  dummy  model  for  the  variance  has  standard  errors  0.0107 
(ML,  see  Panel  4)  and  0.0141  (FWLS,  see  Panel  3),  and  in  the  model  with 


342  5  Diagnostic  Tests  and  Model  Adjustments 


Panel  5:  Dependent  Variable:  DAAA 

Method:  Least  Squares 

Sample:  1950:01  1999:12 

Included  observations:  600 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

0.006393 

0.006982 

0.915697 

0.3602 

DUS3MT 

0.274585 

0.014641 

18.75442 

0.0000 

Panel  6:  Dependent  Variable:  RESOLSA2 
Method:  Least  Squares 
Sample(adjusted):  1950:02  1999:12 
Included  observations:  599  after  adjusting  endpoints 

Variable  Coefficient  Std.  Error  t-Statistic  Prob. 

_ C  0.023025  0.003336  6.901212  0.0000 

RESOLS(  —  1)A2  0.211512  0.039997  5.288181  0.0000 


Panel  7:  Dependent  Variable:  DAAA 

Method:  Least  Squares 

Sample(adjusted):  1950:02  1999:12 

Included  observations:  599  after  adjusting  endpoints 

Weighting  series:  1/STDEV  (v;  =  (STDEV;)2=  fitted  value  of  Panel  6) 


Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

DUS3MT 

0.008738 

0.284731 

0.006354 

0.015628 

1.375250 

18.21882 

0.1696 

0.0000 

Panel  8:  Dependent  Variable:  DAAA 

Method:  Maximum  Likelihood  (BHHH),  clustered  variances 
Sample:  1950:01  1999:12 
Included  observations:  600 

Convergence  achieved  after  14  iterations _ 


Variable 

Coefficient 

Std.  Error 

z-Statistic 

Prob. 

C 

0.013154 

0.003749 

3.508211 

0.0005 

DUS3MT 

0.246218 

0.003566 

69.04759 

0.0000 

Variance  Equation 

Constant  (y, ) 

0.010647 

0.000647 

16.45021 

0.0000 

4-i  fe) 

1.023405 

0.101769 

10.05619 

0.0000 

Exhibit  5.23  (Contd.) 

OLS  of  AAA  bond  rate  on  Treasury  Bill  rate  (Panel  5),  step  1  of  LWLS  (Panel  6,  auxiliary 
regression  of  residuals  to  estimate  model  with  clustered  variances),  step  2  of  LWLS  (Panel  7, 
WLS  with  estimated  variances  obtained  from  Panel  6),  and  ML  for  model  with  clustered 
variances  (Panel  8). 

clustered  variances  the  standard  errors  are  0.0036  (ML,  see  Panel  8)  and 
0.0156  (FWLS,  see  Panel  7). 

(iii)  Conclusion 

A  natural  question  is  which  model  for  the  variance  should  be  preferred.  To 
answer  this  question  we  should  test  the  validity  of  the  specified  models  for 
heteroskedasticity.  This  is  further  analysed  in  Example  5.18  at  the  end  of  the 
next  section. 


Exercises:  T:  5.6a,  b;  E:  5.25a,  5.28a-c. 
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5.4.5  Tests  for  homoskedasticity 

Motivation  of  diagnostic  tests  for  heteroskedasticity 

When  heteroskedasticity  is  present,  ML  and  FWLS  will  in  general  offer  a  gain 
in  efficiency  as  compared  to  OLS.  But  efficiency  is  lost  if  the  disturbances  are 
homoskedastic.  In  order  to  decide  which  estimation  method  to  use,  we  first 
have  to  test  for  the  presence  of  heteroskedasticity. 

It  is  often  helpful  to  make  plots  of  the  least  squares  residuals  e,  and 
their  squares  e f  as  well  as  scatters  of  these  variables  against  explana¬ 
tory  variables  x,  or  against  the  fitted  values  y,  =  x\b.  This  may  provide 
a  first  indication  of  deviations  from  homoskedastic  error  terms.  Diagnostic 
tests  like  the  CUSUMSQ  discussed  in  Section  5.3.3  are  also  helpful.  Further, 
if  the  disturbances  in  the  model  y,  =  x'fi  +  s,  are  heteroskedastic  and  a  model 
of  =  hlz'ff)  has  been  postulated,  then  it  is  of  interest  to  test  whether  this 
model  for  the  variances  is  adequately  specified.  Let  P  be  the  (ML  or  FWLS) 
estimate  of  p  with  corresponding  residuals  e,  =  y,  —  x'fi  and  let  y  be  the 
estimate  of  y  and  of  =  h(z[y).  Then  the  standardized  residuals  e;/cr,  should 
be  (approximately)  homoskedastic. 

In  this  section  we  discuss  some  tests  for  homoskedasticity  —  that  is, 
Goldfeld-Quandt,  Likelihood  Ratio,  Breusch-Pagan,  and  White. 


The  Goldfeld-Quandt  test 

The  Goldfeld-Quandt  test  requires  that  the  data  can  be  ordered  with  non¬ 
decreasing  variance.  The  null  hypothesis  is  that  the  variance  is  constant  for 
all  observations,  and  the  alternative  is  that  the  variance  increases.  To  test  this 
hypothesis,  the  ordered  data  set  is  split  in  three  groups.  The  first  group 
consists  of  the  first  n\  observations  (with  variance  of),  the  second  group  of 
the  last  «2  observations  (with  variance  of),  and  the  third  group  of  the 
remaining  n^  =  n  —  n\  —  n^_  observations  in  the  middle.  This  last  group  is 
left  out  of  the  analysis,  to  obtain  a  sharper  contrast  between  the  variances  in 
the  first  and  second  group.  The  null  and  alternative  hypotheses  are 

H0  :  of  =  of,  H|  :  of  >  of. 

Now  OLS  is  applied  in  groups  1  and  2  separately,  with  resulting  sums  of 
squared  residuals  SSRi  and  SSRi  respectively  and  estimated  variances 
sf  =  SSR\/{n\  —  k)  and  sf  =  SSRi/{ni  —  k).  Under  the  standard  Assump¬ 
tions  1-7  (in  particular,  independently  and  normally  distributed  error 
terms),  SSRj/of  follows  the  /2(w,  —  k)  distribution  for  j  =1,2,  and  these 
two  statistics  are  independent.  Therefore 


SSRi/ («2  —  k)of 
SSR\/(n\  —  k)of 


A!a\ 

sf/of 


F(n2  —  k,  n\  —  k). 
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So,  under  the  null  hypothesis  of  equal  variances,  the  test  statistic 

F  =  sl/s\ 

follows  the  F(n 2  —  k,  n\  —  k)  distribution.  The  null  hypothesis  is  rejected  in 
favour  of  the  alternative  if  F  takes  large  values.  There  exists  no  generally 
accepted  rule  to  choose  the  number  773  of  excluded  middle  observations.  If 
the  variance  changes  only  at  a  single  break-point,  then  it  would  be  optimal  to 
select  the  two  groups  accordingly  and  to  take  n 3  =  0.  On  the  other  hand,  if 
nearly  all  variances  are  equal  and  only  a  few  first  observations  have  smaller 
variance  and  a  few  last  ones  have  larger  variance,  then  it  would  be  best  to 
take  773  large.  In  practice  one  uses  rules  of  thumb  —  for  example,  773  =  n/ 5  if 
the  sample  size  n  is  small  and  773  =  n/3  if  n  is  large. 

Likelihood  Ratio  test 

Sometimes  the  data  can  be  split  in  several  groups  where  the  variance  is 
assumed  to  be  constant  within  groups  and  to  vary  between  groups.  If  there 
are  G  groups  and  of  denotes  the  variance  in  group  7,  then  the  null  hypothesis 
of  homoskedasticity  is 


TT  .  2  _  2  _  __  2 

Flo  ■  —  °2  —  ' ' '  —  ffG’ 

and  the  alternative  is  that  this  restriction  does  not  hold  true.  It  is  left 
as  an  exercise  (see  Exercise  5.6)  to  show  that,  under  the  standard 
Assumptions  1-7,  the  Likelihood  Ratio  test  for  the  above  hypothesis  is 
given  by 


LR  =  n\og(s2ML)  -  77,  log  (sf  MI  )  «r(G-l).  (5.39) 

7=1 

Here  s\/LL  =  e'e/n  is  the  estimated  variance  over  the  full  data  set  (that  is, 
under  the  null  hypothesis  of  homoskedasticity)  and  sj  ML  =  e'^j/nj  is  the 
estimated  variance  in  group  j  (obtained  by  a  regression  over  the  77 y  observa¬ 
tions  in  this  group). 

Breusch-Pagan  L/W-test 

The  Breusch-Pagan  test  is  based  on  models  of  the  type  of  =  h(z!iy)  for  the 
variances,  with  variables  Zi  —  (1,  Zii,  •  ■  ■ ,  Zpi)  that  explain  the  differences  in 
the  variances.  The  null  hypothesis  of  constant  variance  corresponds  to  the 
(p  —  1)  parameter  restrictions 
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h  =  ■  ■  ■  =  7p  =  0. 

The  Breusch-Pagan  test  is  equal  to  the  LM- test 


LM  = 


To  compute  this  test  we  should  calculate  the  first  and  second  order  deriva¬ 
tives  of  the  (unrestricted)  log-likelihood  (5.34)  with  respect  to  the  parameters 
6  =  (/?,  y),  which  are  then  evaluated  at  the  estimated  parameter  values  under 
the  null  hypothesis.  It  is  left  as  an  exercise  (see  Exercise  5.7)  to  show  that  this 
leads  to  the  following  three-step  procedure  to  compute  the  Breusch-Pagan 
test  for  heteroskedasticity. 


Breusch-Pagan  test  for  heteroskedasticity 

•  Step  1:  Apply  OLS.  Apply  OLS  in  the  model  y  =  Xfl  +  e  and  compute  the 
residuals  e  =  y  —  Xb. 

•  Step  2:  Perform  auxiliary  regression.  If  the  variances  of  are  possibly 
affected  by  the  (p  —  1)  variables  (z2i,  •  •  • ,  Zpi),  then  apply  OLS  in  the  auxil¬ 
iary  regression  equation 

e]  =  7i  +  hzii  4 - b  ypZpi  +  m-  (5-40) 

•  Step  3:  LM  =  nR2  of  the  regression  in  step  2.  Then  LM  =  nR2  where  R2  is 
the  coefficient  of  determination  of  the  auxiliary  regression  in  step  2.  This  is 
asymptotically  distributed  as  x2{p  —  1)  under  the  null  hypothesis  of  homo- 
skedasticity. 


White  test 

An  advantage  of  the  Breusch-Pagan  test  is  that  the  function  h  in  the  model 
(5.26)  may  be  left  unspecified.  However,  one  should  know  the  variables 
Zj  (j  =  2,  •  •  • ,  p)  that  influence  the  variance.  If  these  variables  are  unknown, 
then  one  can  replace  the  variables  z/  (j  =  2,  •  •  • ,  p)  by  functions  of  the  ex¬ 
planatory  variables  x —  for  instance,  xi„  ■  ■  ■ ,  x y  and  x2t,  ■  •  ■ ,  x2ki,  in  which 
case  p  —  1  =  2k  —  2.  The  above  LM- test  with  this  particular  choice  of  the 
variables  z  is  called  the  White  test  (without  cross  terms).  An  extension  is  the 
White  test  with  cross  terms,  where  all  cross  products  x/(x/„  with  /  ^  h  are  also 
included  as  ^-variables. 
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Remarks  on  choice  and  interpretation  of  tests 

If  one  can  identify  variables  Zj  for  the  model  for  the  variances  that  are  based 
on  plausible  economic  assumptions,  then  the  corresponding  test  of  Breusch 
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and  Pagan  is  preferred.  If  homoskedasticity  is  rejected,  then  the  variances  can 
be  modelled  in  terms  of  the  variables  Zj.  After  the  model  has  been  estimated 
by  taking  this  type  of  heteroskedasticity  into  account,  one  can  test  whether 
the  standardized  residuals  are  homoskedastic.  If  not,  one  can  try  to  find  a 
better  model  for  the  variances. 

Although  the  above  tests  are  originally  developed  to  test  for  hetero¬ 
skedasticity,  they  can  also  be  considered  more  generally  as  misspeci- 
fication  tests.  For  example,  in  the  White  test  a  significant  correlation 
between  the  squared  OLS  residuals  ej  and  the  squares  and  cross  products 
of  explanatory  variables  may  be  caused  by  misspecification  of  the  func¬ 
tional  form.  The  hypothesis  of  homoskedastic  error  terms  may  also  be 
rejected  because  of  the  presence  of  outliers.  This  is  further  discussed  in 
Section  5.6. 


Example  5.17:  Bank  Wages  (continued) 

We  continue  our  analysis  of  the  bank  wage  data  (see  Example  5.15).  We  will 
discuss  (i)  the  Goldfeld-Quandt  test,  (ii)  the  Breusch-Pagan  test,  (iii)  the 
Tikelihood  Ratio  test,  and  (iv)  tests  for  grouped  data. 

(i)  Goldfeld-Quandt  test 

We  apply  tests  on  homoskedasticity  for  the  Bank  Wage  data.  Using  the 
notation  of  Example  5.10,  the  model  is  given  by 


Ji  —  P  i  +  PiXi  +  ^Dg,  +  p4Dmi  +  fisDn  + 


For  the  Goldfeld-Quandt  test  we  perform  three  regressions,  one  for  each 
job  category  (see  Panels  2-4  of  Exhibit  5.24).  The  two  job  category  dummies 
Di  and  D3  should,  of  course,  be  dropped  in  these  regressions.  For  the  second 
job  category  the  gender  dummy  Dg  also  has  to  be  deleted  from  the  model, 
as  this  subsample  consists  of  males  only.  Because  the  results  in  job  category  2 
are  not  significant,  possibly  owing  to  the  limited  number  of  observations 
within  this  group,  we  will  leave  them  out  and  test  the  null  hypothesis 
that  a\  =  03  against  the  alternative  that  ^3  >  ®i-  Using  the  results  in 
Panels  2  and  4  of  Exhibit  5.24,  the  corresponding  test  is  computed 
as  F  =  (0. 227/0. 188)2  =  1.46,  and  this  has  the  F(n 2  —  k,  n\  —  k)  = 
F( 84  -  4,363  —  4)  =  F(80, 359)  distribution.  The  corresponding  P-value  is 
0.011,  which  indicates  that  the  variance  in  the  third  job  category  is  larger 
than  that  in  the  first  job  category. 
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Panel  1:  Dependent  Variable:  LOGSALARY 

Method:  Least  Squares;  Sample:  1  474;  Included  observations:  474 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

9.574694 

0.054218 

176.5965 

0.0000 

EDUC 

0.044192 

0.004285 

10.31317 

0.0000 

GENDER 

0.178340 

0.020962 

8.507685 

0.0000 

MINORITY 

-0.074858 

0.022459 

-3.333133 

0.0009 

DUMJCAT2 

0.170360 

0.043494 

3.916891 

0.0001 

DUMJCAT3 

0.539075 

0.030213 

17.84248 

0.0000 

R-squared 

0.760775 

S.E.  of  regression 

0.195374 

Panel  2:  Dependent  Variable:  LOGSALARY 

Method:  Least  Squares;  Sample:  JOBCAT=l;  Included  observations:  363 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

9.556421 

0.056544 

169.0083 

0.0000 

EDUC 

0.046360 

0.004494 

10.31572 

0.0000 

GENDER 

0.169221 

0.021275 

7.954113 

0.0000 

MINORITY 

-0.098557 

0.023313 

-4.227561 

0.0000 

R-squared 

0.418977 

S.E.  of  regression 

0.188190 

Panel  3:  Dependent  Variable:  LOGSAL 

Method:  Least  Squares;  Sample:  JOBCAT=2;  Included  observations:  27 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

10.39388 

0.067739 

153.4409 

0.0000 

EDUC 

-0.004634 

0.006319 

-0.733397 

0.4704 

MINORITY 

-0.019166 

0.027543 

-0.695845 

0.4932 

R-squared 

0.039055 

S.E.  of  regression 

0.071427 

Panel  4:  Dependent  Variable:  LOGSAL 

Method:  Least  Squares;  Sample:  JOBCAT=3;  Included  observations:  84 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

9.675982 

0.274004 

35.31327 

0.0000 

EDUC 

0.066967 

0.016525 

4.052588 

0.0001 

GENDER 

0.211185 

0.080797 

2.613780 

0.0107 

MINORITY 

0.260611 

0.119540 

2.180112 

0.0322 

R-squared 

0.308942 

S.E.  of  regression 

0.227476 

Panel  5:  Dependent  Variable:  RESA2 

Method:  Least  Squares;  Sample:  1  474;  Included  observations:  474 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

0.035166 

0.003427 

10.26280 

0.0000 

DUMJCAT2 

-0.018265 

0.013023 

-1.402511 

0.1614 

DUMJCAT3 

0.020103 

0.007904 

2.543209 

0.0113 

R-squared 

0.019507 

Exhibit  5.24  Bank  Wages  (Example  5.17) 

Regression  for  wage  data  of  all  employees  (Panel  1)  and  for  the  three  job  categories  separately 
(Panel  2  for  category  1,  Panel  3  for  category  2,  and  Panel  4  for  category  3),  and  Breusch-Pagan 
test  (Panel  5,  RES  denotes  the  residuals  of  Panel  1). 
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(ii)  Breusch-Pagan  test 

The  Breusch-Pagan  test  for  the  multiplicative  model  of  =  ey'+'/lDl,+y?iDi‘  can 
be  computed  from  the  regression  in  Panel  5  of  Exhibit  5.24.  Note  that  in  step 
2  of  the  Breusch-Pagan  test  the  dependent  variable  is  ej,  not  log(e2).  The 
explained  variable  in  this  regression  consists  of  the  squared  OLS  residuals  of 
the  above  regression  model  for  the  wages.  The  test  result  for  the  hypothesis 
that  y2  =  y3  =  0  is  LM  =  nR 2  =  474(0.0195)  =  9.24.  With  the  y2{2)  distri¬ 
bution,  the  (asymptotic)  P-value  is  0.010.  This  again  indicates  that  the 
hypothesis  of  homoskedastic  error  terms  should  be  rejected. 

(iii)  Likelihood  Ratio  test 

The  Tikelihood  Ratio  test  (5.39)  for  equal  variances  in  the  three  job  categor¬ 
ies  can  also  be  computed  from  the  results  in  Exhibit  5.24.  For  each  regression 
in  the  exhibit,  the  standard  error  of  regression  (s)  is  computed  by  least 
squares,  and  s^L  can  then  be  computed  by  s^L  =  s 2.  For  the  regression 
over  the  full  sample  with  n  =  474  in  Panel  1,  this  gives  s  =  0.195  and 
sml  =  iff s2  =  0.0377.  In  a  similar  way,  using  the  results  for  the  three  job 
categories  in  Panels  2-4  of  Exhibit  5.24,  we  obtain  s\  ML  =  0.0350, 
s2  ML  =  0.0045,  and  s2  ML  =  0.0493.  With  these  values,  the  LR- test  is 
computed  as  LR  =  474  log  (0.0377)  -  363  log  (0.0350)  -  27  log  (0.0045)  - 
84  log  (0.0493)  =  61.2.  With  the  (asymptotic)  /2(2)  distribution  the  P-value 
is  P  =  0.000,  so  that  homoskedasticity  is  again  rejected. 

(iv)  Tests  for  grouped  data 

Next  we  consider  the  data  obtained  after  grouping,  as  described  in  Example 
5.13.  The  result  of  estimating  the  above  regression  model  for  the  grouped 
data  was  given  in  Panel  1  of  Exhibit  5.20,  and  is  repeated  in  Panel  1  of 
Exhibit  5.25.  Panel  2  of  Exhibit  5.25  shows  the  corresponding  White  test  for 
homoskedasticity.  Note  that  the  square  of  a  dummy  variable  is  equal  to  that 
dummy  variable,  so  that  the  squares  of  dummies  are  not  included  as  explana¬ 
tory  variables  in  the  White  test.  This  outcome  does  not  lead  to  the  rejection 
of  homoskedasticity.  However,  if  we  use  the  model  a2  =  a2 /rij  then  the 
Breusch-Pagan  test  in  Panel  3  of  Exhibit  5.25  gives  a  value  of 
LM  =  nR2  —  26  ■  0.296  =  7.69  with  P-value  0.006.  This  leads  to  rejection 
of  homoskedasticity.  This  test  relates  the  variance  directly  to  the  inverse  of 
the  group  size. 

The  foregoing  results  illustrate  the  importance  of  using  all  the  available 
information  on  the  variances  of  the  disturbance  terms.  This  also  becomes 
evident  in  the  scatter  plot  in  Exhibit  5.25  (d),  which  relates  the  squared  OLS 
residuals  to  the  inverse  of  the  group  sizes  1  /«/. 
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Panel  1:  Dependent  Variable:  MEANLOGSAL 

Method:  Least  Squares 

Sample(adjusted):  1  26 

Included  observations:  26 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

9.673440 

0.141875 

68.18272 

0.0000 

MEANEDUC 

0.033592 

0.010022 

3.351783 

0.0032 

GENDER 

0.249522 

0.074784 

3.336567 

0.0033 

MINORITY 

-0.024444 

0.062942 

-0.388348 

0.7019 

DUMJCAT2 

0.019526 

0.090982 

0.214610 

0.8322 

DUMJCAT3 

0.675614 

0.084661 

7.980253 

0.0000 

Panel  2:  White  Heteroskedasticity  Test:  1 

F-statistic 

Obs*R-squared 

0.839570 

5.448711 

Probability 

Probability 

0.554832 

0.487677 

Test  Equation: 

Dependent  Variable:  RESA2 

Method:  Least  Squares 

Sample:  1  26 

Included  observations:  26 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-0.037059 

0.112378 

-0.329770 

0.7452 

MEANEDUC 

0.002928 

0.018407 

0.159088 

0.8753 

MEANEDUCa2 

— 1.30E-05 

0.000720 

-0.018086 

0.9858 

GENDER 

0.009429 

0.018527 

0.508896 

0.6167 

MINORITY 

0.009845 

0.015601 

0.631018 

0.5355 

DUMJCAT2 

DUMJCAT3 

0.019932 

0.019311 

0.022435 

0.020769 

0.888406 

0.929819 

0.3854 

0.3641 

R-squared 

0.209566 

Panel  3:  Dependent  Variable:  RESA2 

Method:  Least  Squares 

Sample(adjusted):  1  26 

Included  observations:  26  after  adjusting  endpoints 

Variable  Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C  0.002398 

0.008236 

0.291213 

0.7734 

1/GROUPSIZE  0.059922 

0.018879 

3.173942 

0.0041 

R-squared  0.295649 

(d) 


Exhibit  5.25  Bank  Wages  (Example  5.17) 


Regression  for  grouped  wage  data  (Panel  1),  White  heteroskedasticity  test  (Panel  2,  RES  are 
the  residuals  of  Panel  1),  and  Breusch-Pagan  test  for  heteroskedasticity  related  to  group  size 
(Panel  3)  with  scatter  diagram  of  squared  residuals  against  inverse  of  group  size  (d). 
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Example  5.18:  Interest  and  Bond  Rates  (continued) 

We  continue  our  previous  analysis  of  the  interest  and  bond  rate  data  (see 
Examples  5.14  and  5.16).  We  will  discuss  (i)  heteroskedasticity  tests  based 
on  different  models,  (ii)  evaluation  of  the  obtained  results,  and  (iii)  our 
conclusion. 

(i)  Tests  on  heteroskedasticity  based  on  different  models 

We  consider  again  the  model  y,-  =  a  4-  pxt  +  e,  for  the  relation  between  the 
monthly  changes  in  the  AAA  bond  rate  (y,-)  and  the  monthly  changes  in  the 
three-month  Treasury  Bill  rate  (x,).  In  the  foregoing  we  considered  different 
possible  models  for  the  variances  of  of  the  disturbances  —  that  is,  (i)  of  =  a2xj , 

(ii)  of  =  y1  +  y2D„  where  D ,  is  a  dummy  variable  for  the  second  half 
(1975-99)  of  the  considered  time  period,  and  (iii)  of  = 

Now  we  use  these  models  to  test  for  the  presence  of  heteroskedasticity.  For 
the  models  (ii)  and  (iii)  this  can  be  done  by  testing  whether  y2  differs  signifi¬ 
cantly  from  zero.  The  results  in  Panels  4  and  8  of  Exhibit  5.23  show  that  the 
null  hypothesis  of  homoskedastic  disturbances  is  rejected  for  both  models 
(P  =  0.000).  Exhibit  5.26  shows  the  result  of  the  White  test.  The  P-value  of 
this  test  is  P  =  0.046  (see  Panel  2).  At  5  per  cent  significance  we  still  reject  the 
null  hypothesis  of  homoskedasticity,  but  the  tests  based  on  the  explicit  models 
(ii)  and  (iii)  have  smaller  P-values. 


Panel  1:  Dependent  Variable:  DAAA 
Method:  Least  Squares 
Sample:  1950:01  1999:12 
Included  observations:  600 

Variable  Coefficient  Std.  Error  t-Statistic  Prob. 

— C  0.006393  0.006982  0.915697  0.3602 

DUS3MT  0.274585  0.014641  18.75442  0.0000 


Panel  2:  White  Heteroskedasticity  Test: 

F-statistic  3.106338  Probability  0.045489 

Obs*R-squared  6.179588  Probability  0.045511 

Test  Equation: 

Dependent  Variable:  RESA2 
Method:  Least  Squares 
Sample:  1950:01  1999:12 
Included  observations:  600 


Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

DUS3MT 

DUS3MTA2 

0.027654 

-0.000224 

0.006560 

0.003246 

0.007073 

0.002804 

8.518663 

-0.031639 

2.339087 

0.0000 

0.9748 

0.0197 

R-squared 

0.010299 

Mean  dependent  var 

0.029144 

Exhibit  5.26  Interest  and  Bond  Rates  (Example  5.18) 

OLS  (Panel  1)  and  White  heteroskedasticity  test  (Panel  2)  for  regression  of  changes  in  AAA 
bond  rate  on  changes  in  Treasury  Bill  rate  (RES  in  Panel  2  denotes  the  residuals  of  the 
regression  in  Panel  1). 
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Exhibit  5.27  Interest  and  Bond  Rates  (Example  5.18) 


Time  plots  of  standardized  residuals  of  AAA  bond  rate  data,  for  OLS  (STRESOLS  {a)),  for 
model  with  variance  proportional  to  the  square  of  DUS3MT  (STRES1  ( b ),  and  (c)  shows  the 
scatter  diagram  of  these  standardized  residuals  against  DUS3MT),  for  dummy  variance  model 
(STRES2  (d)),  and  for  clustered  variance  model  (STRES3  (e)). 
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(ii)  Evaluation  of  the  results 

It  is  of  interest  to  compare  the  success  of  the  models  (i),  (ii),  and  (iii) 
in  removing  the  heteroskedasticity.  For  this  purpose  we  compute  the 
standardized  residuals  of  the  three  models  —  that  is,  {y,  —  a  —  Plots 

of  the  standardized  residuals  are  in  Exhibit  5.27  (b,  d,  e),  together  with  the 
plot  of  the  standardized  OLS  residuals  e,/s  in  (a).  This  shows  that  model  (i) 
has  some  very  large  standardized  residuals,  corresponding  to  observations  in 
months  where  x,  is  close  to  zero,  see  Exhibit  5.27(c).  Such  observations  get 
an  excessively  large  weight.  The  standardized  residuals  of  models  (ii)  and  (iii) 
still  show  some  changes  in  the  variance,  but  somewhat  less  than  the 
OLS  residuals. 

(iii)  Conclusion 

The  overall  conclusion  is  that  the  models  considered  here  are  not  able  to 
describe  the  relation  between  AAA  bond  rates  and  the  Treasury  Bill  rate  over 
the  time  span  1950-99.  This  means  that  we  should  either  consider  less 
simplistic  models  or  restrict  the  attention  to  a  shorter  time  period.  We  will 
return  to  these  data  in  Chapter  7,  where  we  discuss  the  modelling  of  time 
series  data  in  more  detail. 

Exercises:  T:  5.6c,  5.7;  E:  5.25b-e,  5.31c. 


5.4.6  Summary 

If  the  error  terms  in  a  regression  model  are  heteroskedastic,  this  means 
that  some  observations  are  more  informative  than  others  for  the  under¬ 
lying  relation.  Efficient  estimation  requires  that  the  more  informative 
observations  get  a  relatively  larger  weight  in  estimation.  One  can  proceed 
as  follows. 

•  Apply  a  test  for  the  possible  presence  of  heteroskedasticity.  If  one  has  an 
idea  what  are  the  possible  causes  of  heteroskedasticity,  it  is  helpful  to 
formulate  a  corresponding  model  for  the  variance  of  the  error  terms  and 
to  apply  the  Breusch-Pagan  test.  If  one  has  no  such  ideas,  one  can  apply 
the  White  test  or  the  Goldfeld-Quandt  test. 

•  If  tests  indicate  the  presence  of  significant  heteroskedasticity,  then  OLS 
should  not  be  routinely  applied,  as  it  is  not  efficient  and  the  usual 
formulas  for  the  standard  errors  (as  computed  by  software  packages) 
do  not  apply.  If  one  sees  no  possibility  of  formulating  a  meaningful 
model  for  the  heteroskedasticity,  then  one  can  apply  GMM  —  that  is, 
OLS  with  White  standard  errors. 
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•  If  one  can  formulate  a  model  for  the  heteroskedasticity  (for  instance,  an 
additive  or  a  multiplicative  model),  then  the  model  parameters  can  be 
estimated  by  weighted  least  squares  if  the  variances  are  known  up  to  a 
scale  factor.  Otherwise  one  can  use  feasible  weighted  least  squares  or 
maximum  likelihood,  with  the  usual  approximate  distributions  of  the 
estimators. 

•  Let  e,  =  yi  —  x'fi  be  the  zth  residual  and  let  aj  be  the  estimated  variance 
of  the  z'th  disturbance.  Then  the  model  for  heteroskedasticity  may  be 
evaluated  by  checking  whether  the  scaled  residuals  e,/<7;  are  homoske- 
dastic.  If  this  is  not  the  case,  one  can  try  to  improve  the  model  for  the 
variances,  or  otherwise  apply  OLS  with  White  standard  errors. 
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5.5  Serial  correlation 


5.5.1  Introduction 

Interpretation  of  serial  correlation 

As  before,  let  the  relation  between  the  dependent  variable  y  and  the  inde¬ 
pendent  variables  x  be  specified  by 

y,  =  x!fi  +  e„  i=l,  •••,«.  (5.41) 

The  disturbances  are  said  to  be  serially  correlated  if  there  exist  observations 
i  ^  j  so  that  £,  and  £;-  have  a  non-zero  correlation.  In  this  case  the  covariance 
matrix  H  is  not  diagonal.  This  means  that,  apart  from  the  systematic  parts 
modelled  by  x'fi  and  x'fi,  the  observations  y,  and  y;  have  something  more  in 
common. 

In  general,  the  purpose  of  (5.41)  is  to  model  all  systematic  factors  that 
influence  the  dependent  variable  y.  If  the  error  terms  are  serially  correlated, 
this  means  that  the  model  is  not  successful  in  this  respect.  One  should  then  try 
to  detect  the  possible  causes  for  serial  correlation  and,  if  possible,  to  adjust  the 
model  so  that  its  disturbances  become  uncorrelated.  For  example,  it  may  be 
that  in  (5.41)  an  important  independent  variable  is  missing  (omitted  vari¬ 
ables),  or  that  the  functional  relationship  is  non-linear  instead  of  linear 
(functional  misspecification),  or  that  lagged  values  of  the  dependent  or  inde¬ 
pendent  variables  should  be  included  as  explanatory  variables  (neglected 
dynamics).  We  illustrate  this  by  two  examples. 


E 


“  •*  * 
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Example  5.19:  Interest  and  Bond  Rates  (continued) 

We  continue  our  analysis  of  the  interest  and  bond  rate  data  and  will  discuss 
(i)  graphical  evidence  of  serial  correlation  for  these  data,  and  (ii)  an  economic 
interpretation  of  this  serial  correlation. 

(i)  Graphical  evidence  for  serial  correlation 

We  consider  the  linear  model 


y,  =  a  +  fix,  +  £,-,  /  =  !,•■•,  600, 
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(c) 


Exhibit  5.28  Interest  and  Bond  Rates  (Example  5.19) 

Residuals  of  regression  of  changes  in  AAA  bond  rates  on  changes  in  Treasury  Bill  rates  over  the 
period  1950.01-1999.12  (a),  same  plot  over  subsample  1990.01-1999.12  (b),  and  scatter  plot 
of  residuals  against  their  one-month  lagged  value  (c). 

for  the  monthly  changes  y,  in  the  AAA  bond  rate  and  the  monthly  changes  x, 
in  the  three-month  Treasury  Bill  rate.  The  sample  period  runs  from  January 
1950  to  December  1999.  Exhibit  5.28  shows  graphs  of  the  series  of  residuals 
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e,  over  the  whole  sample  period  (in  (a))  and  also  over  the  period  January 
1990  to  December  1999  (in  (b)).  These  graphs  have  time  on  the  horizontal 
axis  and  the  values  of  the  residuals  on  the  vertical  axis.  Exhibit  5.28  (c)  is  a 
scatter  plot  of  the  residuals  against  their  lagged  values  —  that  is,  the  points  in 
this  plot  are  given  by  (e,_i,  e,).  The  residuals  of  consecutive  months  are 
positively  correlated  with  sample  correlation  coefficient  r  =  0.28.  In  60  per 
cent  of  the  months  the  residual  e,  has  the  same  sign  as  the  residual  et-\  in  the 
previous  month. 

(ii)  Economic  interpretation 

These  results  indicate  that  the  series  of  disturbances  e,  may  be  positively 
correlated  over  time.  Suppose  that  in  some  month  the  change  of  the  AAA 
bond  rate  is  larger  than  would  be  predicted  from  the  change  of  the  Treasury 
Bill  rate  in  that  month,  so  that  £,_ \  =  y,_i  —  a  —  fix,- 1  >  0.  If  e,  and  e,_i  are 
positively  correlated,  then  we  expect  that  e,  >  0  —  that  is,  that  in  the  next 
month  the  change  of  the  AAA  rate  is  again  larger  than  usual.  This  may  be 
caused  by  the  fact  that  deviations  from  an  equilibrium  relation  between  the 
two  rates  are  not  corrected  within  a  single  month  but  that  this  adjustment 
takes  a  longer  period  of  time.  Such  dynamic  adjustments  require  a  different 
model  from  the  above  (static)  regression  model. 
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Example  5.20:  Food  Expenditure  (continued) 

The  investigation  of  serial  correlation  for  cross  section  data  makes  sense  only 
if  the  observations  can  be  ordered  in  some  meaningful  way.  We  will  illustrate 
this  by  considering  a  cross  section  of  budget  data  on  food  expenditure  for  a 
number  of  households.  These  data  were  earlier  discussed  in  Example  4.3 
(p.  204).  We  will  discuss  (i)  the  data,  (ii)  a  meaningful  ordering  of  the  data, 
and  (iii)  the  interpretation  of  serial  correlation  for  these  cross  section  data. 

(i)  The  data 

The  budget  study  of  Example  4.3  consists  of  a  cross  section  of  12,488 
households  that  are  aggregated  in  fifty-four  groups.  Exhibit  5.29  (a)  shows 
a  histogram  of  the  group  sizes.  In  all  that  follows  we  will  delete  the  six  groups 
with  size  smaller  than  twenty.  This  leaves  n  —  48  group  observations  for 
our  analysis  (see  Exhibit  5.29  (b)).  For  each  group  the  following  data  are 
available:  the  fraction  of  expenditure  spent  on  food  (y),  the  total  consump¬ 
tion  expenditure  ( X2 ,  in  $10,000  per  year),  and  the  average  household 
size  (X3).  We  consider  the  following  linear  regression  model: 


Ji  —  P]  +  Plx2i  +  @3X31  +  £;,  i  —  1,  ■  ■  ■ ,  48. 
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(ii)  A  meaningful  ordering  of  the  data 

The  OLS  estimates  of  p2,  and  P3  and  of  the  variance  a 1  =  E[8f]  do  not 
depend  on  the  ordering  of  the  groups.  Exhibit  5.29  (c)  shows  the  scatter 
diagram  of  the  residuals  against  their  lagged  values  —  that  is,  the  scatter  of 
points  (ei-\,ei),  for  a  randomly  chosen  order  of  the  groups.  The  sample 
correlation  between  e,  and  e,_i  is  very  small:  r  =  —0.012.  Of  course, 
it  does  not  make  much  sense  to  compare  the  residual  of  one  observation 
with  the  residual  of  the  previous  observation  in  such  a  randomly  ordered 
sample. 

To  obtain  a  meaningful  ordering  we  first  order  the  data  in  six  segments. 
Each  segment  consists  of  group  observations  with  comparable  household 
size,  with  1  <  x3  <  2  in  the  first  segment  to  5  <  x3  <  6  in  the  fifth  segment, 
and  with  x3  >  6  in  the  last  segment.  The  number  of  observations  in  the  six 
segments  is  respectively  6,  9,  9,  8,  8,  and  8.  Within  each  segment  —  that  is, 
for  ‘fixed’  household  size  —  the  observations  are  ordered  according  to  the 
total  consumption  expenditure.  This  ordering  is  indicated  in  Exhibit  5.29  (d) 
and  ( f ).  With  this  ordering,  we  make  a  scatter  diagram  of  the  residuals  e, 
against  the  previous  residuals  e,-\  within  segments  —  that  is,  for  i  taking  the 
values  2-6,  8-15,  17-24,  26-32,  34-40,  and  42-48  so  that  residuals  are 
compared  only  within  the  same  segment  and  not  between  different  segments. 
The  scatter  diagram  in  ( e )  shows  a  positive  correlation  between  \  and 
and  the  sample  correlation  coefficient  is  r  =  0.43  in  this  case.  This  indicates 
that  the  series  of  error  terms  e,  may  be  positively  correlated  within  each 
segment. 

(iii)  Interpretation  of  serial  correlation  for  cross  section  data 

To  obtain  a  better  understanding,  Exhibits  5.29  (g)  and  (h)  show  the  actual 
values  of  y,  and  the  fitted  values  y,  =  b\  +  b2x 2,  +  &3X3,  for  the  third  segment 
of  households  (where  x3;  =  3.1  for  each  observation),  together  with  the 
residuals  e,  =  y,  —  y,.  Whereas  the  fitted  relation  is  linear,  the  observed  data 
indicate  a  non-linear  relation  with  diminishing  slope  for  higher  levels  of  total 
expenditure.  As  a  consequence,  residuals  tend  to  be  positive  for  relatively 
small  and  for  relatively  large  values  of  total  expenditure  and  they  tend  to  be 
negative  for  average  values  of  total  expenditure.  As  a  consequence,  the 
residuals  are  serially  correlated.  These  results  are  in  line  with  the  earlier 
discussion  in  Example  4.3  (p.  204-5),  as  the  effect  of  income  on  food 
expenditure  declines  for  higher  income  levels.  That  is,  the  relation  between 
x2  and  y  is  non-linear,  so  that  the  linear  regression  model  is  misspecified.  The 
serial  correlation  of  the  ordered  data  provides  a  diagnostic  indication  of  this 
misspecification. 
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(f) 


TOTCONS  against  GROUP 


(§) 


(b) 


TOTCONS 


TOTCONS 


Exhibit  5.29  Food  Expenditure  (Example  5.20) 


(a)  and  (b)  show  histograms  of  the  group  sizes  ((a)  for  all  54  groups,  (b)  for  the  48  groups  with 
size  >  20).  (c)  and  (e)  show  scatter  diagrams  of  the  OLS  residuals  (RESOLS)  against  their 
lagged  values  (RESOLSLAG)  for  random  ordering  ((c),  r=  -0.012)  and  for  systematic 
ordering  ((e),  r  =  0.43).  The  systematic  ordering  is  in  six  segments  according  to  household 
size  (d),  and  the  ordering  within  each  segment  is  by  total  consumption  (f).  (g)  shows  the  actual 
and  fitted  values  in  the  third  segment  (groups  16-24,  average  household  size  3.1)  with 
corresponding  residuals  in  (h). 


“S?  Exercises:  E:  5.27a,  b. 


5.5.2  Properties  of  OLS 

Consequences  of  serial  correlation 

Serial  correlation  is  often  a  sign  that  the  model  should  be  adjusted.  If  one  sees 
no  possibilities  to  adjust  the  model  to  remove  the  serial  correlation,  then  one 
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can  still  apply  OLS.  Serial  correlation  corresponds  to  the  case  where  the 
covariance  matrix  LI  =  £[ee']  of  the  disturbances  is  not  diagonal,  and  we 
assume  that  LI  is  unknown.  As  was  discussed  in  Section  5.4.2,  under  the 
Assumptions  1,  2,  5,  and  6,  OLS  remains  unbiased  and  is  consistent  under 
appropriate  conditions.  In  this  sense  OLS  is  still  an  acceptable  method  of 
estimation.  However,  the  OLS  estimator  is  not  efficient,  and  its  covariance 
matrix  is  not  equal  to  er2(X'X)  1  but  it  depends  on  the  (unknown)  covariance 
matrix  Li  (see  (5.23)).  In  many  cases,  the  OLS  expressions  underestimate  the 
standard  errors  of  the  regression  coefficients  and  therefore  t-  and  F- tests  tend 
to  exaggerate  the  significance  of  these  coefficients  (see  Exercise  5.22  for  an 
illustration). 


Derivation  of  GMM  standard  errors 

Consistent  estimates  of  the  standard  errors  can  be  obtained  by  GMM.  That  is, 
OLS  can  be  expressed  in  terms  of  the  k  moment  conditions 


E[g,l  =  0,  gi  =  EjXi  =  (yt  -  x'iP)xi,  i=l,  •••,«. 

Note  that  the  situation  differs  from  the  one  considered  in  Section  5.4.2,  as  there 
the  functions  g,  are  mutually  independent,  but  this  does  not  hold  true  if  the  e,  are 
serially  correlated.  To  describe  the  required  modifications,  we  use  the  result  (5.23) 
so  that  the  variance  of  b  is  given  by 


var (b)  =  1  ( 1  X'x)  f1  X'Llx)  ( 1  X'x) 
n\n  J  \n  J  \n  ) 


Let  Ojj  denote  the  (/,  ;)th  element  of  ft;  then  07/  =  07,  (as  ft  is  symmetric)  and 


n  n  4-^  z-^ 

z=l  /=! 


(7  ijXjXj 


1  n 

ECiiXiX\ 


n—  1  n 


EE  OijiXiX'j  +  XjX'f). 
i=  1  j—i+1 


In  the  White  correction  for  standard  errors,  the  unknown  variances  a]  in  (5.24) 
are  replaced  by  the  squared  residuals  e}  in  (5.25).  If  we  copy  this  idea  for  the 
current  situation,  then  the  variance  would  be  estimated  simply  by  replacing 
Oij  =  E[sj£j]  by  the  product  e,e,  of  the  corresponding  residuals.  However,  the 
resulting  estimate  of  the  covariance  matrix  of  b  is  useless  because 


1 

n 


E 


ejxiX1: 


n—  1  n 


EE  e/e,(x/x'  +  Xjx'j) 

i=  1  j—i+ 1 


-X'ee'X  =  0. 
n 


A  consistent  estimator  of  the  variance  of  b  can  be  obtained  by  weighting  the 
contributions  of  the  terms  e,e;  to  give  the  estimate 


^  n—  1  n 

-EE  (*<■*,' 


i=  1  /=*+! 


(5.42) 
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The  terms  on  the  diagonal  (with  i  =  j )  have  weight  1,  and  the  terms  with  i  ^  j  are 
given  weights  with  0  <  Wj-i  <  1.  The  weighting  function  w  is  also  called  the 
kernel.  For  example,  the  Bartlett  kernel  has  weights  Wh  =  1  —  |  for  h  <  B  and 
Wh  =  0  for  b  >  B.  To  get  consistent  estimates,  the  bandwidth  B  should  depend  on 
the  sample  size  n  in  such  a  way  that  B  — >  oo  for  n  — >  oo,  but  at  the  same  time  B 
should  be  sufficiently  small  so  that  the  double  summation  in  (5.42)  converges. 
Rules  that  are  applied  in  practice  are  to  take  B  «  w1/3  or,  in  large  samples, 
B  «  n l'5. 


Newey-West  standard  errors 

The  above  method  with  weighting  kernels  is  due  to  Newey  and  West.  The 
corresponding  estimates  of  the  standard  errors  of  the  OLS  estimator  b  are 
called  HAC  —  that  is,  they  are  heteroskedasticity  and  autocorrelation  con¬ 
sistent.  So  the  Newey-West  standard  errors  of  b  are  given  by  the  square  roots 
of  the  diagonal  elements  of  the  matrix 

var  (b)  =  -(X'X)-1V(X'X)-1 
n 

with  the  matrix  V  as  defined  in  (5.42). 


XM511IBR 


Example  5.21 :  Interest  and  Bond  Rates  (continued) 

We  continue  our  analysis  of  the  interest  and  bond  rate  data  (see  Example 
5.19).  Exhibit  5.30  shows  the  result  of  regressing  the  changes  in  AAA  bond 


Panel  1:  Dependent  Variable:  DAAA 

Method:  Least  Squares 

Sample:  1950:01  1999:12 

Included  observations:  600 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

0.006393 

0.006982 

0.915697 

0.3602 

DUS3MT 

0.274585 

0.014641 

18.75442 

0.0000 

R-squared 

0.370346 

Panel  2:  Dependent  Variable:  DAAA 
Method:  Least  Squares 
Sample:  1950:01  1999:12 
Included  observations:  600 

Newey-West  HAC  Standard  Errors  &  Covariance  (lag  truncation=5) 


Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

0.006393 

0.008309 

0.769436 

0.4419 

DUS3MT 

0.274585 

0.021187 

12.95993 

0.0000 

R-squared 

0.370346 

Exhibit  5.30  Interest  and  Bond  Rates  (Example  5.21) 

Regression  of  changes  in  AAA  bond  rates  on  changes  in  Treasury  Bill  rates  with  conventional 
OLS  standard  errors  (Panel  1)  and  with  Newey-West  standard  errors  (Panel  2). 
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rates  on  the  changes  in  the  Treasury  Bill  rates.  The  HAC  standard  errors  in 
Panel  2  are  larger  than  the  standard  errors  computed  by  the  conventional 
OLS  formulas  in  Panel  1.  However,  these  differences  do  not  affect  the 
significance  of  the  relationship.  The  residual  correlation  is  relatively  mild 
(r  =  0.28  (see  Exhibit  5.28  (c))).  In  situations  with  more  substantial  serial 
correlation  the  differences  may  be  much  more  dramatic  (see  Exercise  5.22  for 
an  illustration). 

Exercises:  S:  5.22. 


5.5.3  Tests  for  serial  correlation 

Autocorrelation  coefficients 

Serial  correlation  tests  require  that  the  observations  can  be  ordered  in 
a  natural  way.  For  time  series  data,  where  the  variables  are  observed  sequen¬ 
tially  over  time,  such  a  natural  ordering  is  given  by  the  time  index  i.  For 
cross  section  data  the  observations  can  be  ordered  according  to  one  of  the 
explanatory  variables.  In  the  foregoing  sections  we  considered  the  correl¬ 
ation  between  consecutive  residuals.  The  sample  correlation  coefficient  of 
the  residuals  is  defined  by 


r  — 


L^ii=  2  eiei- 1 


£”= 


Eti  * 


In  practice  one  often  considers  a  slightly  different  (but  asymptotically  equiva¬ 
lent)  expression,  the  first  order  autocorrelation  coeffcient  defined  by 


r\ 


£;= 2  eiei- 1 

e:-Li 


(5.43) 


Large  values  of  r\  may  be  an  indication  of  dynamic  misspecification  (for  time 
series  data)  or  of  functional  misspecification  (for  cross  section  data).  To 
consider  the  possibility  of  more  general  forms  of  misspecification,  it  is 
informative  to  consider  also  the  kth  order  autocorrelation  coeff  dents 


n 


sr^n 

dji=k+ 1  eiei-k 


£" 


i=l 


(5.44) 


This  measures  the  correlation  between  residuals  that  are  k  observations 
apart.  A  plot  of  the  autocorrelations  r^  against  the  lag  k  is  called  the 
correlogram.  This  plot  provides  a  first  idea  of  possible  serial  correlation. 
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We  will  now  discuss  three  tests  for  serial  correlation,  Durbin- Watson, 
Breusch-Godfrey,  and  Ljung-Box. 

The  Durbin-Watson  test 

The  Durbin-Watson  test  is  based  on  the  following  idea.  Let  a1  be  the  variance 
of  the  disturbances  and  let  p  be  the  correlation  between  £,-  and  £,•_  i;  then 
E[(e,  —  e,-i)2]  =  2er2(l  —  p).  So  if  successive  error  terms  are  positively  (nega¬ 
tively)  correlated,  then  the  differences  £,  —  £,_i  tend  to  be  relatively  small 
(large).  The  Durbin-Watson  statistic  is  defined  as 

r  E"=2(G-G-i)2 

E"=i  ' 

This  statistic  satisfies  0  <  d  <  4,  and 

d^2(l-n) 

with  r\  the  first  order  autocorrelation  coefficient  defined  in  (5.43).  In  the 
absence  of  first  order  serial  correlation  r\  ~  0  so  that  d  «  2.  Values  of  d  close 
to  zero  indicate  positive  serial  correlation,  and  values  close  to  four  indicate 
negative  serial  correlation.  Critical  values  to  test  the  null  hypothesis  p  =  0 
depend  on  the  matrix  X  of  explanatory  variables.  However,  lower  and  upper 
bounds  for  the  critical  values  that  do  not  depend  on  X  have  been  calculated 
by  Durbin  and  Watson.  The  use  of  these  bounds  requires  that  the  model 
contains  a  constant  term,  that  the  disturbances  are  normally  distributed,  and 
that  the  regressors  are  non-stochastic  —  for  instance,  lagged  values  of  the 
dependent  variable  y,  are  not  allowed.  The  Durbin-Watson  test  is  nowadays 
mostly  used  informally  as  a  diagnostic  tool  to  indicate  the  possible  existence 
of  serial  correlation. 

□  Derivation  of  the  Breusch-Godfrey  DW-test 

The  Breusch-Godfrey  test  is  an  LM-test  on  serial  correlation.  The  model  is 
given  by 


yi  =  x'ip  +  ei  (5.45) 

e«  =  H - h  yp£i-p  +  (5.46) 

where  satisfies  Assumptions  2-4  and  7.  That  is,  the  bxI  vector  r\  is  distributed 
as  N(0,  (T2/)  so  that  the  ij,  are  homoskedastic  and  serially  uncorrelated.  The 
equation  (5.46)  is  called  an  autoregressive  model  of  order  p  (written  as  AR (p)) 
for  the  error  terms.  For  simplicity  we  consider  in  our  analysis  below  only  the  case 
of  an  AR(1)  model  for  the  error  terms.  In  this  case  we  have 
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Ei  =  ye,-!  +  rjt,  (5.47) 

and  we  assume  that  —  1  <  y  <  1.  By  repetitive  substitution  we  get 

si  =  n,  +  yfii- 1  +  y\-2  +  •  •  •  +  y'~2'/2  +  y'“  V 

So  the  error  term  for  observation  i  is  composed  of  independent  terms  with  weights 
that  decrease  geometrically.  The  absence  of  serial  correlation  corresponds  to  the 
null  hypothesis  that 


Ho :  7  =  0. 

We  derive  the  LM- test  for  this  hypothesis  by  using  the  results  in  Section  4.2.4 
(p.  217-8)  for  non-linear  regression  models  (the  ML  approach  of  Section  4.3.6  is 
left  as  an  exercise  (see  Exercise  5.8)).  As  a  first  step  in  the  derivation  of  the  LM- test 
we  use  (5.45)  and  (5.47)  to  obtain 

yi  =  yy<- 1  +  dfi  -  yxE P  +  n (5.48) 

This  model  contains  2k  +  \  regressors  but  only  k  +  1  parameters  —  that  is,  the 
parameters  satisfy  k  (non-linear)  restrictions.  The  model  (5.48)  can  be  written  as  a 
non-linear  regression  model 


Vi  =  f(Zh  P,  y)  +  >1i , 


where  z,  =  (y,_i,  x\,  x^j)'.  According  to  the  results  for  non-linear  regression 
models  in  Section  4.2.4,  the  LM-test  can  be  computed  by  auxiliary  regressions 
provided  that  the  regressors  Zi  satisfy  the  two  conditions  that  plitn(  ziz'i)  =  Qz 
exists  (and  is  non-singular)  and  that  plim(  1  i^z,)  =  0  (orthogonality).  Under  the 

null  hypothesis  that  y  =  0  in  (5.47),  we  get  e,-  =  >]I  and  (5.45)  shows  that  zt  is  a 
linear  function  of  (^■_1,  x\,  x'l  t )' .  The  above  two  limit  conditions  are  satisfied  if 
plim(iEl4i)  =  PW sEW-i)  =  =  0  and 


plim 


E  x’x'i  E  xix\~  i 
E  Ex<-ix/-i 


Q 


with  Q  a  non-singular  (2k)  x  (2k)  matrix.  According  to  the  results  in  Section 
4.2.4,  the  LM-test  for  y  =  0  can  then  be  computed  as 


LM  =  nR2. 


Here  R 2  is  obtained  from  the  regression  of  the  OLS  residuals  e  =  y  —  Xb  on  the 
gradient  of  the  function  f  —  that  is,  the  vector  of  first  order  derivatives 
df /dfi  and  df/dy  evaluated  in  the  point  (/?,  7)  =  (b,  0).  The  model  (5.48)  gives, 
when  evaluated  at  ft  =  b  and  7  =  0, 
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df/dfi  =  xi  -  yxi- 1  =  x, 
df/dy  =  y,-i  - x\_xf  =  y,_i  - x , b  =  e,-i . 

The  Breusch-Godfrey  test 

The  foregoing  arguments  show  that  the  Breusch-Godfrey  test  is  obtained  as 
LM  =  nR 2  of  the  auxiliary  regression 

e,  =  x\d  +  ye,-\  +  u,,  i  =  2,  ■  ■  • ,  n.  (5.49) 

This  test  has  an  asymptotic  y2(  1)  distribution  under  the  null  hypothesis 
of  absence  of  serial  correlation.  The  LM-test  for  the  null  hypothesis  of 
absence  of  serial  correlation  against  the  alternative  of  AR(p)  errors  in 
(5.46)  can  be  derived  in  a  similar  way.  This  leads  to  the  following  test 
procedure. 


Breusch-Godfrey  test  for  serial  correlation  of  order  p 

•  Step  1:  Apply  OLS.  Apply  OLS  in  the  model  y  =  Xf  +  a  and  compute  the 
residuals  e  =  y  —  Xb. 

•  Step  2:  Perform  auxiliary  regression.  Apply  OLS  in  the  auxiliary  regression 
equation 

e<  =  x\5  +  y^ei-\  +  •  •  •  +  ypei-p  +  w<,  /  =  p  +  1,  •  •  • ,  n. 

•  Step  3:  LM  =  nR 1  of  the  regression  in  step  2.  Then  LM  =  nR 2  where  R1  is 
the  coefficient  of  determination  of  the  auxiliary  regression  in  step  2.  This  is 
asymptotically  distributed  as  yf{p)  under  the  null  hypothesis  of  no  serial 
correlation,  that  is,  if  =  •  •  •  =  yp  =  0. 


An  asymptotically  equivalent  test  is  given  by  the  usual  T-test  on  the  joint 
significance  of  the  parameters  (yl5  •  •  •  ,yp)  in  the  above  auxiliary  regression. 
To  choose  the  value  of  p  in  the  Breusch-Godfrey  test,  it  may  be  helpful  to 
draw  the  correlogram  of  the  residuals  et.  In  practice  one  usually  selects  small 
values  for  p  (p  =  1  or  p  =  2)  and  includes  selective  additional  lags  according 
to  the  data  structure.  For  instance,  if  the  data  consist  of  time  series  that  are 
observed  every  month,  then  one  can  include  all  lags  up  to  order  twelve  to 
incorporate  monthly  effects. 

Box-Pierce  and  Ljung-Box  tests 

As  a  third  test  for  serial  correlation  we  consider  the  Box-Pierce  test  for  the 
joint  significance  of  the  first  p  autocorrelation  coefficients  defined  in  (5.44). 
The  test  statistic  is  given  by 
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(5.50) 


It  is  left  as  an  exercise  (see  Exercise  5.9)  to  show  that  this  test  is  asymptotic¬ 
ally  equivalent  to  the  Breusch-Godfrey  test.  In  particular,  BP  ~  X2(P)  under 
the  null  hypothesis  of  no  serial  correlation. 

Sometimes  the  correlations  in  (5.50)  are  weighted  because  higher  order 
autocorrelations  are  based  on  less  observations  —  that  is,  r*  in  (5.44)  is  based 
on  (n  —  k)  products  of  residuals  This  gives  the  Ljung-Box  test  (also 

denoted  as  the  Q-test) 


T  D  71  +  ^  2 

LB  =  n2^^krk 
k=i n  K 


x2(P)- 


Similar  to  the  Durbin-Watson  test,  the  Box-Pierce  test  and  the  Ljung-Box 
test  also  require  that  the  regressors  x,  in  the  model  (5.45)  are  non-stochastic. 
Otherwise  it  is  better  to  apply  the  Breusch-Godfrey  LM- test. 


Example  5.22:  Interest  and  Bond  Rates  (continued) 

We  perform  serial  correlation  tests  for  the  interest  and  bond  rate  data 
discussed  before  in  Examples  5.19  and  5.21.  Exhibit  5.31,  Panel  1,  shows 
the  results  of  regressing  the  changes  in  the  AAA  bond  rate  on  the  changes 
in  the  Treasury  Bill  rate.  The  Durbin-Watson  statistic  is  equal  to  d  =  1.447, 
so  that  the  first  order  autocorrelation  coefficient  is  n  »  1  -\d  =  0.277. 
Exhibit  5.31,  Panel  2,  contains  the  first  twelve  autocorrelation  coefficients 
of  the  residuals.  The  first  order  autocorrelation  coefficient  is  significant.  The 
Q-test  in  Panel  2  corresponds  to  the  Ljung-Box  test  (with  p  ranging  from 
p  =  ltop  =  12).  Panels  3  and  4  show  the  results  of  the  Breusch-Godfrey  test 
with  one  or  two  lags  of  the  residuals.  All  tests  lead  to  a  clear  rejection  of  the 
null  hypothesis  of  no  serial  correlation. 
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Example  5.23:  Food  Expenditure  (continued) 

Next  we  perform  tests  on  serial  correlation  for  the  data  on  food  expenditure 
discussed  before  in  Example  5.20.  Exhibit  5.32  shows  the  results  of  different 
tests  for  serial  correlation  for  the  budget  data  of  forty-eight  groups  of 
households.  For  a  randomly  chosen  ordering  of  the  groups,  the  correlogram, 
the  Ljung-Box  test,  and  the  Breusch-Godfrey  test  indicate  that  there  is  no 
serial  correlation  (see  Panels  1  and  2). 

Now  we  consider  a  meaningful  ordering  of  the  groups  in  six  segments,  as 
discussed  in  Example  5.20.  Each  segment  consists  of  groups  of  households  of 


E 
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Panel  1:  Dependent  Variable:  DAAA 

Method:  Least  Squares 

Sample:  1950:01  1999:12 

Included  observations:  600 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

DUS3MT 

0.006393 

0.274585 

0.006982 

0.014641 

0.915697 

18.75442 

0.3602 

0.0000 

R-squared 

0.370346 

Durbin-Watson  stat 

1.446887 

Panel  2:  Correlogram  of  residuals 

Lag 

AC 

Q-Stat 

Prob 

1 

0.276 

45.932 

0.000 

2 

-0.076 

49.398 

0.000 

3 

0.008 

49.441 

0.000 

4 

0.034 

50.126 

0.000 

5 

0.055 

51.939 

0.000 

6 

0.101 

58.189 

0.000 

7 

0.035 

58.934 

0.000 

8 

0.049 

60.412 

0.000 

9 

0.044 

61.610 

0.000 

10 

0.008 

61.646 

0.000 

11 

0.032 

62.289 

0.000 

12 

-0.062 

64.624 

0.000 

Panel  3:  Breusch-Godfrey  Serial  Correlation  LM  Test: 

L-statistic 

Obs*R-squared 

51.91631 

48.00277 

Probability 

Probability 

0.000000 

0.000000 

Dependent  Variable:  RESID 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

0.000222 

0.006702 

0.033123 

0.9736 

DUS3MT 

-0.022449 

0.014396 

-1.559364 

0.1194 

RESID(-l) 

0.289879 

0.040231 

7.205297 

0.0000 

R-squared 

0.080005 

Panel  4:  Breusch-Godfrey  Serial  Correlation  LM  Test: 

F-statistic 

Obs*R-squared 

36.01114 

64.68852 

Probability 

Probability 

0.000000 

0.000000 

Dependent  Variable:  RESID 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

0.000311 

0.006606 

0.047153 

0.9624 

DUS3MT 

-0.029051 

0.014271 

-2.035626 

0.0422 

RESID(-l) 

0.342590 

0.041495 

8.256187 

0.0000 

RESID  (—2) 

-0.175063 

0.040616 

-4.310152 

0.0000 

R-squared 

0.107814 

Exhibit  5.31  Interest  and  Bond  Rates  (Example  5.22) 

Regression  of  AAA  bond  rates  (Panel  1)  with  correlogram  of  residuals  (Panel  2,  ‘Q-Stat’  is 
the  Ljung-Box  test)  and  Breusch-Godfrey  tests  on  serial  correlation  (with  order  ps 1  in 
Panel  3  and  p  =  2  in  Panel  4;  these  panels  also  show  the  auxiliary  regression  of  step  2  of 
this  LM- test). 
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Panel  1:  Correlogram  RESRAND 
(randomly  ordered  data,  48  observations) 

Lag 

AC 

Q-Stat 

Prob 

1 

-0.096 

0.4697 

0.493 

2 

0.093 

0.9237 

0.630 

3 

0.066 

1.1560 

0.764 

4 

0.012 

1.1643 

0.884 

5 

0.031 

1.2165 

0.943 

6 

-0.134 

2.2497 

0.895 

7 

0.207 

4.7566 

0.690 

8 

-0.256 

8.6744 

0.370 

9 

0.183 

10.729 

0.295 

10 

-0.019 

10.752 

0.377 

Panel  2:  Breusch-Godfrey  test,  Dependent  Variable:  RESRAND 
Sample(adjusted):  2  48  (included  observations:  47) 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

—  8.66E-05 

0.005750 

-0.015064 

0.9881 

TOTCONS 

-0.000470 

0.008724 

-0.053863 

0.9573 

AHSIZE 

9.38E-05 

0.001248 

0.075099 

0.9405 

RESRAND(-l) 

-0.101757 

0.156142 

-0.651694 

0.5181 

R-squared 

0.009857 

Panel  3:  Correlogram  RESORD 
(systematically  ordered  data) 

Sample:  1-6,  8-15,  17-24,  26-32,  34- 

-40,  42-48  (43  obs) 

Lag 

AC 

Q-Stat 

Prob 

1 

0.327 

4.9124 

0.027 

2 

0.115 

5.5369 

0.063 

3 

-0.039 

5.6113 

0.132 

4 

-0.340 

11.362 

0.023 

5 

-0.253 

14.618 

0.012 

6 

0.007 

14.620 

0.023 

7 

0.087 

15.031 

0.036 

8 

0.353 

21.912 

0.005 

9 

0.213 

24.503 

0.004 

10 

0.076 

24.842 

0.006 

Panel  4:  Breusch-Godfrey  test,  Dependent  Variable:  RESORD 

Sample:  2-6,  8-15,  17-24,  26-32,  34-40,  42-48  (included  obs  42) 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-0.007112 

0.005876 

-1.210341 

0.2336 

TOTCONS 

0.016883 

0.009605 

1.757771 

0.0868 

AHSIZE 

-0.000463 

0.001233 

-0.375160 

0.7096 

RESORD(-l) 

0.480625 

0.168762 

2.847947 

0.0071 

R-squared 

0.183146 

Exhibit  5.32  Food  Expenditure  (Example  5.23) 

Correlograms  of  residuals  and  auxiliary  regressions  of  step  2  of  Breusch-Godfrey  test  for 
budget  data  with  randomly  ordered  data  (Panels  1  and  2,  residuals  RESRAND)  and  with 
systematically  ordered  data  (Panels  3  and  4,  residuals  RESORD). 
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comparable  size,  and  the  observations  within  a  segment  are  ordered 
according  to  the  total  consumption  expenditure.  The  six  segments  consist 
of  the  observations  with  index  1-6,  7-15,  16-24,  25-32,  33-40,  and  41-48. 
We  investigate  the  presence  of  first  order  serial  correlation  within  these 
segments.  At  the  observations  i  =  7, 16,25,33,  and  41,  the  residuals  e,  and 
e,-\  correspond  to  different  segments  and  the  correlations  between  these 
residuals  are  excluded  from  the  analysis.  This  leaves  forty-two  pairs  of 
residuals  for  analysis.  The  results  are  in  Panels  3  and  4  of  Exhibit 

5.32.  The  correlogram,  the  Ljung-Box  test,  and  the  Breusch-Godfrey  test 
(with  LM  =  nR1  =  42  ■  0.18  =  7.69  with  P  =  0.006)  all  reject  the  absence  of 
serial  correlation.  This  indicates  misspecification  of  the  linear  model  —  that 
is,  the  fraction  of  expenditure  spent  on  food  depends  in  a  non-linear  way  on 
total  expenditure  (see  also  Example  5.20  (p.  356-8)). 

Exercises:  T:  5.8,  5.9,  5.11;  S:  5.21;  E:  5.30a,  b,  5.31d. 


5.5.4  Model  adjustments 

Regression  models  with  lagged  variables 

If  the  residuals  of  an  estimated  equation  are  serially  correlated,  this  indicates 
that  the  model  is  not  correctly  specified.  For  (ordered)  cross  section  data  this 
may  he  caused  by  non-linearities  in  the  functional  form,  and  we  refer  to 
Section  5.2  for  possible  adjustments  of  the  model.  For  time  series  data,  serial 
correlation  means  that  some  of  the  dynamic  properties  of  the  data  are  not 
captured  by  the  model.  In  this  case  one  can  adjust  the  model  —  for  instance, 
by  including  lagged  values  of  the  explanatory  variables  and  of  the  explained 
variable  as  additional  regressors.  As  an  example,  suppose  that  the  model 


Ji  —  ft]  +  Pixi  + 


is  estimated  by  OLS  and  that  the  residuals  are  serially  correlated.  This 
suggests  that  £,-  =  yt  —  Pi  —  P2xi  is  correlated  with  e,_ \  =  y,_ \  ~Pi  —  f}2xi-i- 
This  may  be  caused  by  correlation  of  y,  with  y^\  and  x,_ i,  which  can  be 
expressed  by  the  model 


y,  =yi  +  7ixi  +  73^-1  +  74)4-1  +  Vi-  (5.51) 

When  the  disturbances  ;/,  of  this  model  are  identically  and  independently 
distributed  (IID),  then  the  model  is  said  to  have  a  correct  dynamic  specifica¬ 
tion.  The  search  for  correct  dynamic  specifications  of  time  series  models  is 
discussed  in  Chapter  7. 
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Regression  model  with  autoregressive  disturbances 

In  this  section  we  consider  only  a  special  case  that  is  often  applied  as 
a  first  step  in  modelling  serial  correlation.  Here  it  is  assumed  that  the 
dynamics  can  be  modelled  by  means  of  the  disturbances  s,,  and  more  in 
particular  that  Si  satisfies  the  AR(1)  model  (5.47)  so  that  e,  =  yg,-_ \  +  ;/,.  This 
is  called  the  regression  model  with  AR(1)  errors.  If  one  substitutes 
Si  =  y,  —  Pi  —  P2x,  and  Si— i  =  y,-_ \  -  Pi  -  P2Xi-i,  it  follows  that  (5.47)  can 
be  written  as 


y,  =  P i(i  -  t)  +  P2xi  -  Piyxi-\  +  yy,-\  +  >u-  (5.52) 

This  is  of  the  form  (5.51)  with  ya  =  /i1(  1  —  y),  y2  =  P2,  73  =  ~P2 y,  and 
y4  =  y,  so  that  the  parameters  satisfy  the  restriction 

7274  +  73  =  Piy  -  Piy  =  o. 


Estimation  by  Cochrane-Orcutt 

If  the  terms  r\l  are  IID  and  normally  distributed,  then  the  parameters  Pi,  p2, 
and  y  can  be  estimated  by  NLS.  An  alternative  is  to  use  the  following  iterative 
two-step  method.  Note  that  for  a  given  value  of  y  the  parameters  Pi  and  P2 
can  be  estimated  by  OLS  in 


y,  ~  734-1  =  P i(l  -  7)  +  P2(xi  -  yxi-i)  + 

On  the  other  hand,  if  the  values  of  Pi  and  P2  are  given,  then  e ,•  =  y,  — 
Pi  —  P2Xj  can  be  computed  and  hence  7  can  be  estimated  by  OLS  in 

e,  =  ye,-]  +  rji . 

We  can  exploit  this  as  follows.  As  a  first  step  take  y  =  0  and  estimate  p  1  and  p2, 
by  OLS.  This  estimator  is  consistent  (provided  that  —  1  <  y  <  1  (see  Chapter 
7)),  but  it  is  not  efficient.  Let  e,  =  y,  —  bi  —  b2x,  be  the  OLS  residuals;  then  in 
the  second  step  y  is  estimated  by  regressing  e,  on  e,-i.  This  estimator  (say  y)  is 
also  consistent.  To  improve  the  efficiency  we  can  repeat  these  two  steps.  First  a 
new  estimate  of  Pi  and  p2  is  obtained  by  regressing  y,  —  yy,_i  on  a  constant 
and  x,  —  yx,-\ .  Second,  if  e,  are  the  new  residuals,  then  a  new  estimate  of  y  is 
obtained  by  regressing  e ,  on  e,_i .  This  process  is  iterated  till  the  estimates  of 
Pi,  P2,  and  y  converge.  This  is  called  the  Cochrane-Orcutt  method  for  the 
estimation  of  regression  models  with  AR(  1 )  errors.  The  estimates  converge  to 
a  local  minimum  of  the  sum-of-squares  criterion  function,  and  it  may  be 
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worthwhile  to  redo  the  iterations  with  different  initial  values  for  the  param¬ 
eters  /?!,  p2,  and  y. 

As  the  regression  model  with  AR(1)  errors  is  a  restriction  of  the 
more  general  model  (5.51),  this  restriction  can  be  tested  in  the  usual 
way  —  for  instance,  by  the  Wald  test.  The  regression  model  with  AR(1) 
errors  has  been  popular  because  it  is  simple  and  because  the  Cochrane- 
Orcutt  estimator  can  be  computed  by  iterated  regressions.  Nowadays  more 
general  dynamic  models  like  (5.51)  are  often  preferred,  as  will  be  discussed  in 
Chapter  7. 

Example  5.24:  Interest  and  Bond  Rates  (continued) 

We  continue  our  analysis  of  the  interest  and  bond  rate  data.  In  Example  5.22 
in  the  previous  section  we  found  clear  evidence  for  the  presence  of  serial 
correlation  for  these  data.  We  now  estimate  the  adjusted  model  (5.51),  with 
the  result  shown  in  Panel  2  of  Exhibit  5.33.  Both  lagged  terms  (x,-_ \  and  y,-\) 
are  significant,  and  y2y4  +  y3  =  0.252  ■  0.290  —  0.080  =  —  0.007  is  close  to 
zero.  The  Wald  test  on  the  restriction  y2y4  +  y3  =  0  in  Panel  3  has  a  P-value 
of  P  =  0.64,  so  that  this  restriction  is  not  rejected.  The  regression  model  with 
AR(  1 )  errors  is  therefore  not  rejected,  and  the  estimation  results  of  this  model 
are  shown  in  Panel  4  of  Exhibit  5.33.  To  evaluate  this  last  model,  Panel 
1  contains  for  comparison  the  results  of  OLS.  Including  AR(1)  errors  leads  to 
an  increase  of  R2,  but  this  should  not  be  a  surprise.  The  Durbin-Watson 
statistic  is  more  close  to  2  (1.90  as  compared  with  1.45),  but  recall  that  for 
models  with  lags  this  statistic  does  not  provide  consistent  estimates  of  the 
correlation  between  the  residuals  (see  p.  362  and  Exercise  5.11).  Panel  5 
contains  the  correlogram  of  the  OLS  residuals  and  of  the  residuals  of  the 
model  with  AR(1)  errors  —  that  is,  of  (5.52).  The  residuals  of  the  model 
(5.52)  still  contains  some  significant  correlations.  Other  models  are  needed 
for  these  data,  and  this  will  be  further  discussed  in  Chapter  7. 

Example  5.25:  Food  Expenditure  (continued) 

In  Example  5.23  we  concluded  that  there  exists  significant  serial  correlation 
for  the  residuals  of  the  linear  food  expenditure  model  of  Example  5.20.  This 
is  an  indication  that  this  linear  model  is  not  correctly  specified.  As  it  makes 
no  sense  to  include  ‘lagged’  variables  for  cross  section  data,  we  consider 
instead  another  specification  of  the  functional  relation  between  income  and 
food  expenditure.  We  will  discuss  (i)  a  non-linear  model,  (ii)  the  Breusch- 
Godfrey  test  for  this  non-linear  model,  and  (iii)  the  outcome  and  interpret¬ 
ation  of  the  test. 
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Panel  1:  Dependent  Variable:  DAAA  (1950.01 

-  1999.12) 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

0.006393 

0.006982 

0.915697 

0.3602 

DUS3MT 

0.274585 

0.014641 

18.75442 

0.0000 

R-squared 

0.370346 

Durbin-Watson  stat 

1.446887 

Panel  2:  Dependent  Variable:  DAAA  (1950.01 

-  1999.12) 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

0.004780 

0.006712 

0.712171 

0.4766 

DUS3MT 

0.252145 

0.015007 

16.80237 

0.0000 

DUS3MT(— 1) 

-0.079636 

0.017800 

-4.473948 

0.0000 

DAAA(-l) 

0.289881 

0.040344 

7.185151 

0.0000 

R-squared 

0.420728 

Durbin-Watson  stat 

1.897040 

Panel  3:  Wald  Test _ 

Null  Hypothesis:  C(2)*C(4)  +  C(3)  =  0 

F-statistic  0.215300  Probability  0.642814 

Chi-square  0.215300  Probability  0.642645 


Panel  4:  Dependent  Variable:  DAAA  (1950.01 
Convergence  achieved  after  3  iterations 

-  1999.12) 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

0.006668 

0.009423 

0.707620 

0.4795 

DUS3MT 

0.252361 

0.014989 

16.83586 

0.0000 

AR(1) 

0.288629 

0.040228 

7.174887 

0.0000 

R-squared 

0.420519 

Durbin-Watson  stat 

1.896645 

Panel  5:  Correlograms 
OLS  residuals 

AR(1)  residuals 

Lag 

AC 

Q-Stat 

Prob 

Lag 

AC 

Q-Stat 

Prob 

1 

0.276 

45.932 

0.000 

1 

0.050 

1.5232 

0.217 

2 

-0.076 

49.398 

0.000 

2 

-0.181 

21.402 

0.000 

3 

0.008 

49.441 

0.000 

3 

0.013 

21.509 

0.000 

4 

0.034 

50.126 

0.000 

4 

0.023 

21.829 

0.000 

5 

0.055 

51.939 

0.000 

5 

0.036 

22.622 

0.000 

6 

0.101 

58.189 

0.000 

6 

0.090 

27.510 

0.000 

7 

0.035 

58.934 

0.000 

7 

-0.011 

27.582 

0.000 

8 

0.049 

60.412 

0.000 

8 

0.030 

28.136 

0.000 

9 

0.044 

61.610 

0.000 

9 

0.036 

28.926 

0.001 

10 

0.008 

61.646 

0.000 

10 

-0.005 

28.938 

0.001 

11 

0.032 

62.289 

0.000 

11 

0.058 

31.023 

0.001 

12 

-0.062 

64.624 

0.000 

12 

-0.044 

32.189 

0.001 

Exhibit  5.33  Interest  and  Bond  Rates  (Example  5.24) 

Regression  models  for  AAA  bond  rates,  simple  regression  model  (Panel  1),  dynamic  model 
with  single  lags  (Panel  2)  with  Wald  test  for  AR(1)  errors  (Panel  3),  simple  regression  model 
with  AR(  1 )  errors  (Panel  4),  and  correlograms  of  residuals  (Panel  5,  for  residuals  of  Panel  1  on 
the  left  and  for  residuals  of  Panel  4  on  the  right). 


372  5  Diagnostic  Tests  and  Model  Adjustments 


(i)  Non-linear  food  expenditure  model 

In  Example  4.3  (p.  205)  we  considered  a  non-linear  functional  form  for  the 
budget  data.  That  is, 


34  —  +  @2x2i  +  /^4x3  /'  + 


where  y,  is  the  fraction  of  total  expenditure  spent  on  food,  xz  is  total 
expenditure  (in  $10,000  per  year),  and  X3  is  the  (average)  household  size  of 
households  in  group  i. 

Panel  1  of  Exhibit  5.34  shows  the  resulting  estimates  and  a  scatter  diagram 
(in  ( b ))  of  the  NLS  residuals  e,  against  their  lagged  values  with  correl¬ 
ation  r=  0.167  (as  compared  to  r  =  0.43  for  the  residuals  of  the  linear  model 
in  Example  5.20  (see  Exhibit  5.29  ( g ))). 


(ii)  Breusch-Godfrey  test  for  the  non-linear  model 

To  test  whether  the  residual  correlation  is  significant  we  apply  the  Breusch- 
Godfrey  LM- test  for  the  non-linear  model.  Step  1  of  this  test  consists  of  NLS, 
with  NLS  residuals  u,.  To  perform  step  2  of  this  test,  we  first  reformulate  the 
non-linear  model  with  AR(1)  error  terms  as  a  non-linear  regression  model, 
similar  to  (5.48)  for  the  linear  model.  The  AR(1)  model  is  £,  =  ye,-_i  +  //,, 
where  ;/(  ~  NID(0,er^),  and  the  non-linear  model  can  be  written  in  terms  of 
the  independent  error  terms  i],  as 

34  =  0l(l  -  y)  +  734-1  +  p2x2i  -  +  f^4xi<  -  y@4x3,  i—  1  +  >1r 

This  is  a  non-linear  regression  model  y,  =  f(x„  9)  +  //,  with  6x1 
vector  of  regressors  given  by  x,  =  (l,y,_i,X2/, X2,/-i,x3;, x3,;-i)/  and  with 
5x1  parameter  vector  6  =  P4->y)'-  Now  step  2  of  the  Breusch- 

Godfrey  test  can  be  performed  as  described  in  Section  4.2.4  (p.  217-8)  for 
LM- tests  —  that  is,  the  NLS  residuals  e,  are  regressed  on  the  gradient 
evaluated  at  the  restricted  NLS  estimates  6  =  {b\,  bz,  bj,  b 4, 0),  with  y  =  0 
and  with  the  NLS  estimates  of  the  other  parameters,  as  given  in  Panel  1  of 
Exhibit  5.34.  The  regressors  in  step  2  are  therefore  (for  y  =  0)  given  by 


9f  1  1 

— —  =  1  -y  =  1, 

dfb 

9f  _  Pi  _  Pi  _  b3 

—  x2i  yx2,i-\  —  x2i  ’ 


df 

dh 


=  P2X2,  lo§  (X2i)  ~  yfox2,i-\  l°g  (x2, i— 1 )  =  bzX^  log  (x2t), 


Ji 


bi 
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Panel  1:  Dependent  Variable:  FRACFOOD 

Method:  Least  Squares 

Sample:  1  48  (groups  with  size  >  20) 

Included  observations:  48 

Convergence  achieved  after  7  iterations 
FRACFOOD=C(l)+C(2)*TOTCONSAC(3)+C(4)*AHSIZE 

Parameter 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C(l) 

0.453923 

0.054293 

8.360611 

0.0000 

C(2) 

-0.271015 

0.053437 

-5.071693 

0.0000 

C(3) 

0.412584 

0.115538 

3.570982 

0.0009 

C(4) 

0.016961 

0.000991 

17.11004 

0.0000 

R-squared 

0.939246 

Durbin-Watson  stat 

1.957808 

ib) 

0.04  -I 

n  =  42 

r  =  0.167  o 

0.03  - 

g  °.°2  - 

o 

2 

° 

o  0.01  - 

0 

2 

°  o0  0  °<P 

o 

o.oo- 

a 

°0 

-0.01  - 

a  ° 

o  °°  o 

0 

-0.02  - 

- 1 - 1 - 

- 1 - 1 

-0.04  -0.02  0.00 

0.02  0.04 

RESNONLINLAG 

Panel  3:  Dependent  Variable:  RESNONLIN 

Method:  Least  Squares 

Sample:  2-6,  8-15,  17-24,  26-32,  34-40,  42-48  (included  obs  42) 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

0.154181 

0.070516 

2.186478 

0.0352 

TOTCONSA(0. 412584) 

-0.155027 

0.070053 

-2.212985 

0.0331 

TOTCONSA(0.412584)< 

0.078232 

0.038573 

2.028173 

0.0498 

LOG(TOTCONS) 

AHSIZE 

0.000423 

0.001012 

0.418210 

0.6782 

RESN  ONLIN  ( —  1 ) 

0.195965 

0.152414 

1.285746 

0.2065 

R-squared 

0.157982 

Exhibit  5.34  Food  Expenditure  (Example  5.25) 

Non-linear  regression  model  for  budget  data  (Panel  1),  scatter  plot  of  residuals  against  their 
lags  (within  segments  (b)),  and  auxiliary  regression  of  step  2  of  Breusch-Godfrey  test  on  serial 
correlation  (Panel  3). 

df  _  _ 

on  *^3  i  yXSA-l  in 

dp4 

df  i 

^  =  Ji- 1  -  P\  -  Plx2,i-1  -  ^4x3,i-i  =  Vi-1  -  b\  -  b2X^i_  1  -  /74X3,,_1  =  G- 1- 
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Therefore  the  required  regression  in  step  2  is 


e,  =  5\  +  (52*2i  +  ^3^2/  l°g  (*2 i)  +  <54X3,  +  yei-i  +  w,. 

The  Breusch-Godfrey  test  is  LM  =  nR 1  of  this  regression. 

(iii)  Outcome  and  interpretation  of  the  test 

The  results  in  Panel  3  of  Exhibit  5.34  show  that  LM  =  jiR2  — 
42  ■  0.158  =  6.64  with  P-value  P  =  0.010  (there  are  forty-two  relevant  ob¬ 
servations  because  residuals  in  different  segments  should  not  be  compared  to 
each  other  (see  Example  5.23)).  This  indicates  that  there  still  exists  signifi¬ 
cant  serial  correlation,  although  the  coefficient  of  e,-\  is  not  significant 
(P  =  0.207  (see  Panel  3  of  Exhibit  5.34),  as  compared  to  P  =  0.007  for  the 
linear  model  in  Panel  4  of  Exhibit  5.32).  So  the  above  simple  non-linear 
model  does  not  capture  all  the  non-linear  effects  of  the  variable  x2  on  y,  but  it 
is  an  improvement  as  compared  to  the  linear  model. 

This  example  shows  that  serial  correlation  tests  can  be  applied  as  diagnos¬ 
tic  tools  for  cross  section  data,  provided  that  the  observations  are  ordered  in 
a  meaningful  way. 


E 


XM526INP 


Example  5.26:  Industrial  Production 

Whereas  the  two  foregoing  examples  were  concerned  with  data  from  finance 
and  microeconomics,  serial  correlation  is  also  often  a  relevant  issue  for 
macroeconomic  time  series.  For  instance,  serial  correlation  may  result  be¬ 
cause  of  prolonged  up-  and  downswings  of  macroeconomic  variables  from 
their  long-term  growth  path.  Although  the  discussion  of  time  series  models  is 
postponed  till  Chapter  7,  we  will  now  give  a  brief  illustration.  We  will  discuss 
(i)  the  data,  (ii)  a  simple  trend  model,  (iii)  tests  on  serial  correlation,  and  (iv) 
interpretation  of  the  result. 

(i)  The  data 

We  consider  quarterly  data  on  industrial  production  in  the  USA  over  the 
period  1950.1  until  1998.3.  The  data  are  taken  from  the  OECD  main 
economic  indicators. 

(ii)  A  simple  trend  model 

We  denote  the  series  of  industrial  production  by  INP.  In  order  to  model  the 
exponential  growth  of  this  series,  we  fit  a  linear  trend  to  the  logarithm  of  this 
series.  We  estimate  the  simple  regression  model 


log  (INP,)  =  a  +  fii  +  e„ 
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where  x,  =  i  denotes  the  linear  trend.  The  result  is  shown  in  Panel  1  of 
Exhibit  5.35.  The  estimated  quarterly  growth  rate  is  around  0.8  per  cent, 
corresponding  to  a  yearly  growth  rate  of  around  3.3  per  cent. 

(iii)  Tests  on  serial  correlation 

The  Durbin- Watson  statistic  is  very  close  to  zero,  indicating  a  strong  positive 
serial  correlation  in  the  residuals.  This  is  also  clear  from  the  autocorrelations 
of  the  residuals  in  Panel  2  of  Exhibit  5.35.  Both  the  Ljung-Box  test  in  Panel  2 


Panel  1:  Dependent  Variable:  LOG(IP) 

Method:  Least  Squares 

Sample(adjusted):  1950:1  1998:3 

Included  observations:  195  after  adjusting  endpoints 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

3.321756 

0.012141 

273.6052 

0.0000 

@TREND(  1950.1) 

0.008197 

0.000108 

75.71580 

0.0000 

R-squared 

S.E.  of  regression 
Durbin- Watson  stat 

0.967431 

0.085094 

0.084571 

Panel  2:  Correlogram  of  residuals 

Lag 

AC 

Q-Stat 

Prob 

1 

0.941 

175.36 

0.000 

2 

0.875 

327.80 

0.000 

3 

0.813 

459.89 

0.000 

4 

0.768 

578.37 

0.000 

5 

0.716 

681.88 

0.000 

6 

0.686 

777.61 

0.000 

7 

0.649 

863.63 

0.000 

8 

0.632 

945.73 

0.000 

9 

0.609 

1022.4 

0.000 

10 

0.589 

1094.5 

0.000 

11 

0.556 

1159.2 

0.000 

12 

0.548 

1222.2 

0.000 

Panel  3:  Breusch-Godfrey  Serial  Correlation  LM  Test: 

F-statistic 

Obs*R-squared 

747.5207 

172.9098 

Probability 

Probability 

0.000000 

0.000000 

Test  Equation: 

Dependent  Variable:  RESID 

Method:  Least  Squares 

Variable 

Coefficient 

Std.  Error  t-Statistic 

Prob. 

C 

0.000138 

0.004108  0.033538 

0.9733 

@TREND(  1950.1) 

— 2.13E-06 

3.66E-05  -0.058197 

0.9537 

RESID  (—1) 

1.026273 

0.072090  14.23600 

0.0000 

RESID(— 2) 

-0.090350 

0.072114  -1.252877 

0.2118 

R-squared 

0.886717 

Exhibit  5.35  Industrial  Production  (Example  5.26) 

Linear  trend  model  for  industrial  production  (in  logarithms)  (Panel  1,  @TREND  denotes  the 
linear  trend),  correlogram  of  residuals  (Panel  2),  and  Breusch-Godfrey  test  (Panel  3). 
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| - Residual .  Actual - Fitted  | 


Exhibit  5.36  Industrial  Production  (Example  5.26) 

Actual  and  fitted  values  of  US  quarterly  industrial  production  (in  logarithms,  right  vertical 
axis)  and  plot  of  residuals  with  95%  confidence  interval  (left  vertical  axis). 

and  the  Breusch-Godfrey  test  in  Panel  3  (with  p  =  2)  strongly  reject  the 
absence  of  serial  correlation. 

(iv)  Interpretation 

The  time  plot  of  the  residuals  in  Exhibit  5.36  shows  that  the  growth  was 
above  average  for  a  long  period  from  around  1965  to  1980.  Such  prolonged 
deviations  from  the  linear  trend  line  indicate  that  this  simple  linear  trend 
model  misses  important  dynamical  aspects  of  the  time  series.  More  realistic 
models  for  this  series  will  be  presented  in  Chapter  7. 

Exercises:  T:  5.10;  E:  5.27c,  d,  5.29a,  b. 


5.5.5  Summary 

If  the  error  terms  in  a  regression  model  are  serially  correlated,  this  means 
that  the  model  misses  some  of  the  systematic  factors  that  influence  the 
dependent  variable.  One  should  then  try  to  find  the  possible  causes  and  to 
adjust  the  model  accordingly.  The  following  steps  may  be  helpful  in  the 
diagnostic  analysis. 

•  Order  the  observations  in  a  natural  way.  The  ordering  is  evident  for  time 
series  data,  and  serial  correlation  is  one  of  the  major  issues  for  such  data 
(see  Chapter  7).  In  the  case  of  cross  section  data,  the  analysis  of  serial 
correlation  makes  sense  only  if  the  observations  are  ordered  in  some 
meaningful  way. 

•  Check  whether  serial  correlation  is  present,  by  drawing  the  correlogram 
of  the  residuals  and  by  performing  tests,  in  particular  the  Breusch- 
Godfrey  LM- test. 
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•  If  serial  correlation  is  present,  OLS  is  no  longer  efficient  and  the  usual 
formulas  for  the  standard  errors  do  not  apply.  If  it  is  not  possible  to 
adjust  the  model  to  remove  the  serial  correlation,  then  OLS  can  be 
applied  with  Newey-West  standard  errors. 

•  The  best  way  to  deal  with  serial  correlation  is  to  adjust  the  model  so  that 
the  correlation  disappears.  This  may  sometimes  be  achieved  by 
adjusting  the  specification  of  the  functional  relation  —  for  instance, 
by  including  lagged  variables  in  the  model  (this  is  further  discussed  in 
Chapter  7). 


378  5  Diagnostic  Tests  and  Model  Adjustments 


5.6  Disturbance  distribution 


5.6.1  Introduction 

Weighted  influence  of  individual  observations 

In  ordinary  least  squares,  the  regression  parameters  are  estimated  by  minim¬ 
izing  the  criterion 


s(P)  =  ^2(y>  -  x'iP)2- 

i=i 

This  means  that  errors  are  penalized  in  the  same  way  for  all  observations  and 
that  large  errors  are  penalized  more  than  proportionally.  An  alternative  is  to 
apply  weighted  least  squares  where  the  errors  are  not  all  penalized  in  the 
same  way.  For  instance,  for  time  series  data  the  criterion 

Sw(P)  =  J2wn-t(yt-x'tP)2  (5.53) 

t=  l 


(with  0  <  w  <  1)  assigns  larger  weights  to  more  recent  observations.  This 
criterion  may  be  useful,  for  instance,  when  the  parameters  p  vary  over  time 
so  that  the  most  recent  observations  contain  more  information  on  the  current 
parameter  values  than  the  older  observations.  In  Section  5.4  the  use  of 
weighted  least  squares  was  motivated  by  heteroskedastic  error  terms.  For 
time-varying  parameters  the  criterion  (5.53)  allows  for  relatively  larger 
residuals  for  older  observations. 

Overview 

In  Section  5.6.2  we  investigate  the  question  of  which  observations  are  the 
most  influential  ones  in  determining  the  outcomes  of  an  ordinary  least 
squares  regression.  If  the  outcomes  depend  heavily  on  only  a  few  observa¬ 
tions,  it  is  advisable  to  investigate  the  validity  of  these  data.  If  the  explan¬ 
ation  of  outlying  data  falls  within  the  purpose  of  the  analysis,  then  the 
specification  of  the  model  should  be  reconsidered. 

Section  5.6.3  contains  a  test  for  normality  of  the  disturbances.  It  may  be 
that  outlying  observations  are  caused  by  special  circumstances  that  fall 
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outside  the  scope  of  the  model.  The  influence  of  such  data  can  be  reduced  by 
using  a  less  sensitive  criterion  function  —  for  example, 


Sabs(/*)  =  X>-*'£|- 

i=  1 

If  the  outcomes  of  estimation  and  testing  methods  are  less  sensitive  to 
individual  observations  and  to  the  underlying  model  assumptions,  then 
such  methods  are  called  robust.  Robust  methods  are  discussed  in  Section 
5.6.4. 


5.6.2  Regression  diagnostics 

The  leverage  of  an  observation 

To  characterize  influential  data  in  the  regression  model  y  =  Xf  +  e,  we  use 
the  hat-matrix  H  defined  by  (see  Section  3.1.3  (p.  123)) 

H  =  XfX'X^X'. 

The  explained  part  y  of  the  dependent  variable  y  is  given  by  y  =  Xb  =  Hy. 
The  /th  diagonal  element  of  H  is  denoted  by 

hj  =  x'-[X'X)~lXj,  (5.54) 

where  the  1  x  ^  vector  x'  is  the  /th  row  of  the  n  x  k  matrix  X.  The  value  hj  is 
called  the  leverage  of  the  /th  observation.  The  leverages  satisfy  0  <  hj  <  1 
and  Y^=i  hj  =  k  (see  Exercise  5.12).  So  the  mean  leverage  is  equal  to  k/n.  A 
large  leverage  hj  means  that  the  values  of  the  explanatory  variables  Xj  are 
somewhat  unusual  as  compared  to  the  average  of  these  values  over  the 
sample. 

Characterization  of  outliers 

An  observation  is  called  an  outlier  if  the  value  of  the  dependent  variable  y7 
differs  substantially  from  what  would  be  expected  from  the  general  pattern 
of  the  other  observations.  To  test  whether  the  /th  observation  is  an  outlier, 
we  consider  the  model  with  a  dummy  variable  for  the  /th  observation  — 
that  is, 


y,  =  x'fi  +  yD/i  +  £„  i=  !,•••,  n. 


(5.55) 
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with  Djj  =  1  and  D/(  =  0  for  i  ^  The  null  hypothesis  that  the  ;th  observa¬ 
tion  fits  in  the  general  pattern  of  the  data  corresponds  to  the  null  hypothesis 
that  y  =  0,  and  this  can  be  tested  by  the  t-test. 


Derivation  of  studentized  residuals 

Let  Dj  denote  the  n  x  1  vector  with  elements  D/;,  i  =  1,  •  •  • ,  n;  then  the  model  in 

(5.55)  can  be  written  as  y  =  Xp  +  D7y  +  e.  According  to  the  result  of  Frisch- 
Waugh  in  Section  3.2.5  (p.  146),  the  OLS  estimator  of  y  is  given  by 


y  =  (D'jMDj)-1D'jMy  =  (D'D;  -  D'XfX'X^X'Dy^D'e 

=  (i-x'(x'xr1x/r1^Ir^- 

Here  M  =  I  —  H  and  e  =  My  is  the  usual  vector  of  OLS  residuals  in 
the  model  y  =  X/3  +  s —  that  is,  in  (5.55)  with  y  =  0.  If  £  ~  N(0,  <x2/), 
then  e  =  Ms  ~  N(0,  a2M)  and  e;  =  D'e  ~  N(0,  <r2(l  —  hj)),  so  that 
y  ~  N(0,  cr2/ ( 1  —  hi)).  Let  s2  be  the  OLS  estimator  of  a2  based  on  the  model 

(5.55),  including  the  dummy.  Then  the  t-value  of  y  in  (5.55)  is  given  by 


*  7  ej  I  r  Cf-\ 

e,  = - - - = - ,  (5.56) 

Sj/\J  1  -  hj  5/V1  -  hi 

This  statistic  follows  the  t(n  —  k  —  1)  distribution  under  the  null  hypothesis  that 
y  =  0.  The  ;th  observation  is  an  outlier  if  y  is  significant  —  that  is,  if  the  residual  e, 
or  the  leverage  hj  is  sufficiently  large.  The  residuals  e*  are  called  the  studentized 
residuals.  Note  that  the  dummy  variable  is  included  only  to  compute  the  studen¬ 
tized  residual.  This  should  not  be  interpreted  as  an  advice  to  include  dummies  in 
the  model  for  each  outlier.  Indeed,  if  one  uses  the  rule  of  thumb  |f|  >  2  for 
significance,  then  one  may  expect  that  5  per  cent  of  all  observations  are  ‘outliers’. 
Such  ‘ordinary’  outliers  are  of  no  concern,  but  one  should  pay  attention  to  large 
outliers  (with  t-values  further  away  from  zero)  and  try  to  understand  the  cause  of 
such  outliers,  as  this  may  help  to  improve  the  model. 


The  ‘leave-one-out’  interpretation  of  studentized  residuals 

The  ;th  studentized  residual  can  also  be  obtained  by  leaving  out  the  ;th 
observation.  That  is,  perform  a  regression  in  the  model  y,  =  x'fi  +  e,  using 
the  (n  —  1)  observations  with  i  ^  /,  so  that  the  ;th  observation  is  excluded. 
Let  b(j)  and  s2(j)  be  the  corresponding  OLS  estimators  of  /?  and  a1 .  It  is  left  as 
an  exercise  (see  Exercise  5.12)  to  show  that  b(j)  is  the  OLS  estimator  of  /?  in 

(5.55)  (for  n  observations),  that  s2(j)  =  sj  (that  is,  the  OLS  estimator  of  a2  in 

(5.55) ),  and  that  y;-  —  x'jb(j)  =  y  in  (5.55).  With  these  results  it  follows  from 

(5.56)  that  the  studentized  residuals  can  be  computed  as 
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*  Vi  -  x'Mj) 

e'~s  {j)/^hi 

The  studentized  residual  can  be  interpreted  in  terms  of  the  Chow  forecast  test 
of  observation  /,  where  the  forecast  is  based  on  the  model  estimated  from  the 
(n  —  1)  observations  i  ^  j  (see  Exercise  5.12).  If  e*  is  large,  this  means  that  y/ 
cannot  be  predicted  well  from  the  other  observations,  so  that  the  ;th  obser¬ 
vation  does  not  fit  the  general  pattern  of  the  other  observations.  In  this  sense 
the  /th  observation  then  is  an  outlier. 

OLS  may  not  detect  outliers 

It  should  be  noted  that  outliers  may  not  always  be  detected  from  the  plot  of 
OLS  residuals  e,.  This  is  because  OLS  tries  to  prevent  very  large  residuals,  so 
that  ej  may  be  small  even  if  e*  is  large.  Because  of  (5.56),  this  can  occur  if  the 
leverage  of  the  observation  is  large.  This  is  illustrated  by  a  simulation  in 
Exhibit  5.37.  The  outlier  (corresponding  to  the  first  observation,  with  /  =  1) 
is  not  detected  from  the  residuals  if  we  include  all  observations  (a-b),  but 
it  is  revealed  very  clearly  if  the  outlier  observation  is  excluded  from  the 
regression  (c-e). 

Panels  7  and  8  of  Exhibit  5.37  illustrate  that  the  estimates  of  and  a 2 
and  the  sum  of  squared  residuals  (SSR)  in  (5.55)  are  the  same  as  the  esti¬ 
mates  obtained  by  deleting  the  outlier  observation.  The  R 2  of  (5.55)  is 
much  larger,  however.  This  is  simply  caused  by  the  fact  that  the  total  sum 
of  squares  is  much  larger  for  the  set  of  all  observations  (SST  =  1410  in  Panel 
8)  than  for  the  set  of  observations  excluding  the  outlier  (SST  =  285  in 
Panel  7). 


Influence  on  parameter  estimates:  ‘dfbetas’ 

The  influence  of  individual  observations  on  the  estimates  of  /f  can  be  evaluated  as 
follows.  Let  b  be  the  usual  OLS  estimator  in  (5.55)  under  the  restriction  that 
y  =  0,  with  residuals  e,  and  let  b(j)  and  y  be  the  OLS  estimators  in  (5.55)  with 
the  dummy  included,  with  residuals  e(j).  Then  y  =  Xb  +  e  =  Xb(j)  +  D/y  +  e(j), 
so  that 


X(b  -  b(j))  -  Djy  -  e(j)  +  e  =  0. 

If  we  premultiply  this  with  X'  and  use  that  X'e  =  0,  X'e(j)  =  0,  and  X'D;  =  x', 
then  we  obtain 


1 

1  -  hj 


(X'X)-1xjej. 


b  -  b(j)  =  (X'Xy'X'Dfj 


(5.57) 
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Panel  6:  Dependent  Variable:  Y 

Included  observations:  25 

Variable 

Coefficient 

Std.  Error  t- Statistic 

Prob. 

C 

43.99717 

4.873317  9.028179 

0.0000 

X 

-2.988136 

0.448928  -6.656161 

0.0000 

R-squared 

S.E.  of  regression 

0.658269 

4.576554 

Sum  squared  resid 

Total  sum  of  squares 

481.7315 

1409.682 

Panel  7:  Dependent  Variable:  Y 

Included  observations:  24  (first  observation,  outlier,  removed) 

Variable 

Coefficient 

Std.  Error  t- Statistic 

Prob. 

C 

-22.73838 

13.66355  -1.664163 

0.1103 

X 

3.028141 

1.233459  2.454999 

0.0225 

R-squared 

S.E.  of  regression 

0.215043 

3.187157 

Sum  squared  resid 

Total  sum  of  squares 

223.4754 

284.6977 

Panel  8:  Dependent  Variable:  Y 
Included  observations:  25  (DUM1 

is  dummy  for  first  observation) 

Variable 

Coefficient 

Std.  Error  t- Statistic 

Prob. 

C 

-22.73838 

13.66355  -1.664163 

0.1103 

X 

3.028141 

1.233459  2.454999 

0.0225 

DUM1 

64.71023 

12.83368  5.042220 

0.0000 

R-squared 

S.E.  of  regression 

0.841471 

3.187157 

Sum  squared  resid 

Total  sum  of  squares 

223.4754 

1409.682 

Exhibit  5.37  Outliers  and  OLS 

Scatter  diagrams  and  residuals  [(a)-(d)),  studentized  residuals  (e),  regressions  with  outlier 
(Panel  6),  without  outlier  (Panel  7)  and  with  outlier  dummy  (Panel  8). 
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It  is  preferable  to  make  the  difference  bi  —  b/(j)  in  the  /th  estimated  parameter, 
owing  to  the  /th  observation,  invariant  with  respect  to  the  measurement  scale  of 
the  explanatory  variable  X/.  Therefore  this  difference  is  scaled  with  an  estimate  of 
the  standard  deviation  of  b /  —  for  example,  Sj^/afi  where  an  is  the  /th  diagonal 
element  of  (X'X)-1.  This  gives  the  dfbetas  defined  by 

dfbetas/,  =  — — (5.58) 

sj  Vau 


It  is  left  as  an  exercise  (see  Exercise  5.13)  to  show  that  (under  appropriate 
conditions)  the  variance  of  the  ‘dfbetas’  is  approximately  1/n.  So  the  difference 
in  the  parameter  estimates  can  be  stated  to  be  significant  if  the  value  of  (5.58)  is  (in 
absolute  value)  larger  than  2/y/n. 


Influence  on  fitted  values:  ‘dffits’ 

The  influence  of  the  /th  observation  on  the  fitted  values  is  given  by  y  —  y(j),  where 
y  =  Xb  and  y(j)  =  Xb(j).  In  particular,  by  using  (5.57)  the  difference  in  the  fitted 
values  for  y-t  is  given  by 


h 

y,  -  5>/(/)  =  x'j{b  -  b(j) )  =  yz~h  ej- 

As  e,  =  jj  —  yj  it  follows  that 

37  =  hjyj  +  {  1  -hjjyjd). 

Therefore,  the  leverage  hj  gives  the  relative  weight  of  the  observation  y7  itself 
in  constructing  the  predicted  value  for  the  /th  observation.  That  is,  if  hj  is  large, 
then  the  /th  observation  may  be  difficult  to  fit  from  the  other  observations. 
Because  the  variance  of  y/  is  equal  to  E[(x'h  —  x'//)2]  =  a2x'AX,X)~1Xj  =  a2hj,  a 
scale  invariant  measure  for  the  difference  in  fitted  values  is  given  by  the  dffits 
defined  by 


dffits/ 


37  -  Hi) 

si\/hj 


ei  \fhj  =  *  /  hj 

Sj  1  —  hj  '  y  1  —  hj 


Also  in  this  respect,  the  /th  observation  is  influential  if  the  studentized  residual 
or  the  leverage  is  large.  As  var(e* )  ss  1  and  hj  is  generally  very  small  for  large 
enough  sample  sizes,  it  follows  that  ‘dffits’  has  a  variance  of  approximately 
var(‘dffits’)  s=s  var(e* \Jbj)  =  hr  As  Yf’j=  i  =  k,  the  average  variance  is  approxi¬ 
mately  K  Therefore,  differences  in  fitted  values  can  be  stated  to  be  significant  if 
‘dffits’  is  larger  (in  absolute  value)  than  2 \fkjn. 
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What  to  do  with  influential  observations? 

If  the  most  influential  observations  in  the  data  set  are  detected,  the  question 
arises  what  to  do  with  these  observations.  They  can  be  the  most  important 
pieces  of  information,  in  which  case  their  large  influence  is  justified.  If  the 
influential  observations  do  not  fit  well  in  the  general  pattern  of  the  data,  one 
may  be  tempted  to  delete  them  from  the  analysis.  However,  one  should  be 
careful  not  to  remove  important  sample  information.  In  any  case  one  should 
check  whether  these  observations  are  correctly  reported  and  one  should 
investigate  whether  outliers  can  possibly  be  explained  in  terms  of  additional 
explanatory  variables. 


Example  5.27:  Stock  Market  Returns  (continued) 

As  an  illustration  we  consider  the  stock  market  returns  data  that  were 
introduced  in  Example  2.1  (p.  76-7).  We  will  discuss  (i)  the  data  and  the 
possibility  of  outliers  and  (ii)  the  analysis  of  influential  data. 

(i)  The  data  and  the  possibility  of  outliers 

Financial  markets  are  characterized  by  sudden  deviations  from  normal  oper¬ 
ation  caused  by  crashes  or  moments  of  excessive  growth.  Here  we  consider 
data  on  excess  returns  in  the  sector  of  cyclical  consumer  goods  and  in  the 
whole  market  in  the  UK.  These  data  were  previously  analysed  in  Examples 
2.1,  4.4  (p.  223-4),  and  4.5  (p.  243-6),  and  in  Section  4.4.6  (p.  262-5).  In 
Examples  4.4  and  4.5  we  analysed  the  possibility  of  fat  tails  and  now  we  will 
apply  regression  diagnostics  on  these  data. 

(ii)  Analysis  of  influential  data 

The  data  consist  of  monthly  observations  over  the  period  from  January  1980 
to  December  1999.  The  CAPM  corresponds  to  the  simple  regression  model 
yi  =  oc  +  l^x,  +  £,,  where  y,  is  the  excess  return  of  the  sector  and  x,  that  of  the 
market.  Exhibit  5.38  provides  graphical  information  (leverages,  studentized 
residuals,  dfbetas,  and  dffits,  among  others)  and  Exhibit  5.39  displays  the 
characteristics  for  some  of  the  data  points.  The  observation  in  October  1987 
(when  a  crash  took  place)  has  a  very  large  leverage  but  a  small  studentized 
residual,  so  that  this  is  not  an  outlier.  Such  observations  are  helpful  in 
estimation,  as  they  fit  well  in  the  estimated  model  and  reduce  the  standard 
errors  of  the  estimated  parameters.  On  the  other  hand,  the  observations  in 
September  1980  and  September  1982  are  outliers,  but  the  leverages  of  these 
observations  are  small. 
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Exhibit  5.38  Stock  Market  Returns  (Example  5.27) 

Time  plots  of  excess  returns  in  market  (Panel  1 )  and  in  sector  of  cyclical  consumer  goods  (Panel  2), 
scatter  diagram  of  excess  returns  in  sector  against  market  (Panel  3),  regression  of  excess  sector 
returns  on  excess  market  returns  with  corresponding  residuals  (e.  Panel  4),  leverages  (Panel  5), 
standard  deviations  (s,,  Panel  6),  studentized  residuals  (Panel  7),  dfbetas  (Panel  8),  and  dffits 
(Panel  9).  The  dashed  horizontal  lines  in  Panels  4  and  7-9  denote  95%  confidence  intervals. 
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Characteristic 

Residual  e, 

Leverage  hj 

St.  Resid.  e* 

dfbetas 

dffits 

5%  crit.  value 

±2s  = 
±11.086 

zh2  /  n  = 
±0.008 

±2 

±2/ \fn  — 
±0.129 

±2>/(2/»)  = 
±0.183 

1980:06 

-10.694 

0.015* 

-1.956 

-0.209* 

-0.245* 

1980:09 

-20.412* 

0.005 

-3.795* 

-0.139* 

-0.282* 

1981:04 

15.115* 

0.010* 

2.779* 

0.214* 

0.280* 

1981:09 

2.551 

0.059* 

0.474 

-0.115 

0.119 

1982:09 

-19.558* 

0.005 

-3.626* 

-0.086 

-0.250* 

1983:04 

-13.588* 

0.011* 

-2.492* 

-0.204* 

-0.260* 

1987:10 

-2.480 

0.156* 

-0.486 

0.206* 

-0.209* 

1991:02 

8.930 

0.021* 

1.634 

0.215* 

0.240* 

Exhibit  5.39  Stock  Market  Returns  (Example  5.27) 

Characteristics  of  some  selected  influential  observations  in  CAPM  over  the  period  1980.01- 
1999.12  (n  =  240  observations).  An  *  indicates  values  that  differ  significantly  from  zero  (at 
5%  significance). 


Exercises:  T:  5.12a-c,  5.13;  E:  5.29c,  d,  5.30c,  d,  5.31e,  5.33a. 


5.6.3  Test  for  normality 

Skewness  and  kurtosis 

As  was  discussed  in  Chapter  4,  OLS  is  equivalent  to  maximum  likelihood  if 
the  error  terms  are  normally  distributed.  So  under  this  assumption  OLS  is  an 
optimal  estimation  method,  in  the  sense  that  it  is  consistent  and  (asymptotic¬ 
ally)  efficient.  For  this  reason  it  is  of  interest  to  test  Assumption  7  of  normally 
distributed  error  terms.  It  is  also  of  interest  for  other  reasons  —  for  example, 
because  many  econometric  tests  (like  the  t-test  and  the  F-test)  are  based  on 
the  assumption  of  normally  distributed  error  terms. 

Suppose  that  the  standard  Assumptions  1-6  of  the  regression  model  are 
satisfied.  This  means  that 


y,  =  x'fi  +  f=l,  •••,«, 

where  £[£,]  =  0,  E[ej]  —  a1,  and  £[£,£,]  =  0  for  all  i  ^  j.  Then  Assumption  7 
of  normally  distributed  disturbances  can  be  tested  by  means  of  the  OLS 
residuals  e,  =  y ;  —  x\b.  In  particular,  we  can  compare  the  sample  moments 
of  the  residuals  with  the  theoretical  moments  of  the  disturbances  under  the 
null  hypothesis  of  the  normal  distribution.  In  this  case  there  holds  E[sj]  =  0 
and  E[£4J  =  3c4,  so  that  the  skewness  (S)  and  kurtosis  (K)  are  equal  to 


S  =  E[£,3]/u3  =  0,  K  =  E[s*]/c j4  =  3. 


5.6  Disturbance  distribution  387 


If  the  null  hypothesis  of  normality  is  true,  then  the  residuals  e,  should  have  a 
skewness  close  to  0  and  a  kurtosis  close  to  3.  We  suppose  that  the  model 
contains  a  constant  term,  so  that  the  sample  mean  of  the  residuals  is  zero. 
Then  the  ;th  moment  of  the  residuals  is  given  by  ray  =  Y^i=i  e\/n  and  the 
skewness  and  kurtosis  are  computed  as 

S  =  K  —  m4/m\. 

It  can  be  shown  that,  under  the  null  hypothesis  of  normality,  \Jn/6S  and 
x/Wj24(K-3)  are  asymptotically  independently  distributed  as  N(0,  1). 
These  results  can  be  used  to  perform  individual  tests  for  the  skewness  and 
kurtosis. 


Jarque-Bera  test  on  normality 

The  skewness  and  kurtosis  can  also  be  used  jointly  to  test  for  normality.  The 
normal  distribution  has  skewness  S  =  0  and  kurtosis  K  =  3,  and  the  devi¬ 
ation  from  normality  can  be  measured  by 


JB 


(K-  3)) 


2 

=  n 


1 

24 


(K-3Y 


r(2). 


This  is  the  Jarque-Bera  test  on  normality,  and  the  null  hypothesis  is  rejected 
for  large  values  of  JB.  Here  we  will  not  derive  the  asymptotic  y2(2)  distribu¬ 
tion,  but  note  that  the  null  hypothesis  poses  two  conditions  (S  =  0  and 
K  =  3),  so  that  the  test  statistic  has  two  degrees  of  freedom. 


Example  5.28:  Stock  Market  Returns  (continued) 

We  continue  our  analysis  of  Example  5.27  in  the  previous  section 
and  consider  the  Capital  Asset  Pricing  Model  (CAPM)  of  Example  2.5 
(p.  91)  for  the  sector  of  cyclical  consumer  goods.  The  data  consist  of  monthly 
observations  over  the  period  1980-99  (n  =  240).  Exhibit  5.40  (a,  b)  shows 
the  time  series  plot  and  the  histogram  of  the  residuals.  The  skewness  and 
kurtosis  are  equal  to  S  =  —0.28  and  K  =  4.04.  This  gives  values  of 
sJnjbS  —  —1.77  and  \Jn/24(K  —  3)  =  3.30.  The  corresponding  (two- 
sided)  P-value  for  the  hypothesis  that  S  —  0  is  P  =  0.08,  and  for  the  hypoth¬ 
esis  that  K  =  3  it  is  P  =  0.001.  So  the  residuals  have  a  considerably  larger 
kurtosis  than  the  normal  distribution.  The  Jarque-Bera  test  has  value 
JB  =  (  —  1.77)2  +  (3.30)2  =  14.06  with  P-value  0.001.  So  the  assumption 
of  normality  is  rejected.  Exhibit  5.40  (c)  shows  the  histogram  that  results 
when  two  extremely  large  negative  residuals  (in  the  months  of  September 
1980  and  September  1982)  are  removed.  This  has  a  large  effect  on  the 
skewness  and  kurtosis,  and  the  assumption  of  normality  is  no  longer  rejected. 
This  indicates  that  for  the  majority  of  the  sample  period  the  assumption  of 


E 


XM527SMR 
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(a) 


| - RESCAPM  | 


Exhibit  5.40  Stock  Market  Returns  (Example  5.28) 


Time  plot  of  residuals  of  CAPM  (a)  and  histograms  of  all  residuals  (n  =  240  (b))  and  of 
residuals  with  two  outliers  removed  («  =  238  (c)). 


normality  is  a  reasonable  one.  The  two  extreme  observations  were  detected 
as  outliers  in  Example  5.27. 


=©  Exercises:  E:  5.28d,  5.31e. 


5.6.4  Robust  estimation 

Motivation  of  robust  methods 

If  we  apply  OLS,  then  all  the  observations  are  weighted  in  a  similar  way.  On 
the  other  hand,  the  regression  model  (5.55)  with  the  dummy  variable  D; 
effectively  removes  all  effects  of  the  ;th  observation  on  the  estimate  of  /l.  If  this 
observation  is  very  influential  but  not  very  reliable,  it  may  indeed  make  sense 
to  remove  it.  Sometimes  there  are  several  or  even  a  large  number  of  outlying 
data  points,  and  it  may  be  undesirable  to  neglect  them  completely. 

An  alternative  is  to  use  another  estimation  criterion  that  assigns  relatively 
less  weight  to  extreme  observations  as  compared  to  OLS.  Such  estimation 
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methods  are  called  robust,  because  the  estimation  results  are  relatively 
insensitive  to  changes  in  the  data. 

As  a  simple  illustration,  we  first  consider  the  situation  where  the  data 
consist  of  a  random  sample  y„  i  =  1,  ■  ■  ■ ,  n,  and  we  want  to  estimate  the 
centre  of  location  of  the  population.  This  centre  is  more  robustly  estimated 
by  the  sample  median  than  by  the  sample  mean,  as  the  following  simulation 
example  illustrates. 


Example  5.29:  Simulated  Data  of  Normal  and  Student  t(2) 
Distributions 

To  illustrate  the  idea  of  robust  estimation  we  consider  two  data  generating 
processes.  The  first  one  is  the  standard  normal  distribution  N(0,  1).  In  this 
case  the  sample  mean  is  an  efficient  estimator,  and  the  median  is  inefficient. 
The  second  one  is  the  Student  f-distribution  with  two  degrees  of  freedom, 
t( 2).  This  distribution  has  mean  zero  and  infinite  variance.  It  has  very  fat 
tails  so  that  outliers  occur  frequently,  and  the  mean  is  an  inefficient  esti¬ 
mator.  Exhibit  5.41  shows  summary  statistics  of  simulated  data  from  the 
two  distributions.  The  sample  sizes  are  n=  10,  25,  100,  and  400,  with 
1000  replications  for  each  sample  size.  For  every  replication  the  mean  and 
median  of  the  sample  are  computed  as  estimates  of  the  centre  of  location, 
/j  =  0.  The  exhibit  reports  the  range  (the  difference  between  the  maximum 
and  minimum  values  of  these  estimates  over  the  1000  replications)  and  the 
(non-centred)  sample  standard  deviation  \J^2  A,2/ 1000  over  the  replications. 
It  clearly  shows  that  the  mean  is  the  best  estimator  if  the  population 


n  DGP  N(0,  1)  N(0,  1)  t(  2)  t(  2) 

Estimator  Mean  Median  Mean  Median 


10 

St.  Dev. 

25 

Range 
St.  Dev. 

100 

Range 
St.  Dev. 

400 

Range 
St.  Dev. 

Range 

0.322 

0.383 

2.092 

3.145 

0.200 

0.255 

1.254 

1.630 

0.098 

0.126 

0.617 

0.795 

0.050 

0.063 

0.322 

0.383 

1.015 

0.445 

18.903 

3.457 

0.880 

0.290 

20.599 

1.925 

0.325 

0.135 

4.720 

0.984 

0.203 

0.070 

4.542 

0.440 

Exhibit  5.41  Simulated  Data  of  Normal  and  Student  t( 2)  Distribution  (Example  5.29) 

Sample  standard  deviation  and  range  of  sample  mean  and  sample  median  over  1000 
simulation  runs  of  two  DGPs  (N(0,  1)  and  t(2) )  for  different  sample  sizes  (n  =  10,  25, 
100,  and  400). 
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is  normally  distributed  and  that  the  median  is  best  if  the  population  has  the 
t( 2)  distribution.  That  is,  for  distributions  with  fat  tails  the  median  is  a  more 
robust  estimator  than  the  mean. 


Robust  estimation  criteria 

Now  we  consider  the  model  y,-  =  x'/f  +  e,  and  we  suppose  that  Assumptions  1-6 
are  satisfied.  Further  we  suppose  that  p  is  estimated  by  minimizing  a  criterion 
function  of  the  form 


n  n 

s(p)  =  Y,G(yi-x'iP)  =  Y,G^’  <5-59) 

i=l  i=  1 

where  we  write  e,-  =  y,  —  x'/3  for  the  residuals.  The  function  G  is  assumed  to  be 
differentiable  with  derivative  g(e,)  =  dG(ej)/det.  The  first  order  conditions  for  a 
minimum  of  (5.59)  are  given  by 

os(p)  sr  I  \  n 

-TT  =  -Vtt  =  0.  (5.60) 

9p  ,=1 

If  one  defines  the  weights  w,  =  g(ei)/ei,  this  can  also  be  written  as 


n 

WjejXj  =  0. 
!=i 


(5.61) 


Ordinary  least  squares  corresponds  to  the  choice 

1  , 

G(ei)  =  jet,  g(e,)  =  e„  w,  =  \. 

The  function  g(e,)  measures  the  influence  of  outliers  in  the  first  order  conditions 
(5.60)  for  the  estimator  /?.  In  ordinary  least  squares  this  influence  is  a  linear 
function  of  the  residuals.  A  more  robust  estimator  —  that  is,  an  estimator  that  is 
less  sensitive  to  outliers  —  is  obtained  by  choosing 


G(e,-)  =  |e,-|,  g(e,j 


—  1  for  e,-  <  0, 
+  1  for  e,  >  0. 


We  call  this  criterion  function  the  least  absolute  deviation  (LAD).  If  the  observa¬ 
tions  consist  of  a  random  sample  —  that  is,  y,  =  /«  +  £,•  for  i  =  1 ,  •  •  • ,  n  —  then  OLS 
gives  ft  =  J^yi/n  and  LAD  gives  jl  =  med(y,),  the  median  of  the  observations  (see 
Exercise  5. 14).  As  our  simulation  in  Example  5.29  illustrates,  LAD  is  more  robust 
than  OLS,  but  some  efficiency  is  lost  if  the  disturbances  are  normally  distributed. 
The  attractive  properties  of  both  methods  (OLS  and  LAD)  can  be  combined  by 
using,  for  instance,  the  following  criterion: 
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G{e,) 


\e]  if  |e,|  <  c, 

c\ei\—\c2  if  je,-|  >  c. 


(5.62) 


This  criterion  was  proposed  by  Huber.  The  derivative  of  G  is  given  by 


f  ~c  if  e,  <  -c, 

g(e,)  =  <  e,-  if  -c  <  e,  <  c,  (5.63) 

[  c  if  e,  >  c. 

The  corresponding  estimator  of  ft  gives  a  compromise  between  the  efficiency  (for 
normally  distributed  errors)  of  OLS  (obtained  for  c  — >  oo)  and  the  robustness  of 
LAD  (obtained  for  c  J.  0).  In  Exhibit  5.42  the  Huber  criterion  is  compared  with 
OLS  and  LAD.  Relatively  small  residuals  have  a  linear  influence  and  constant 
weights,  and  large  residuals  have  constant  influence  and  declining  weights.  The 
influence  of  outliers  is  reduced  because  (5.63)  imposes  a  threshold  on  the  function 
g(e,)- 


Exhibit  5.42  Three  estimation  criteria 


Criterion  functions  (G  in  ( a-c )),  first  order  derivatives  (influence  functions  g  in  (d-f)),  and 
weights  (Wi  in  =  0  in  (g-i))  of  three  criteria,  OLS  ((a),  (d),  (g)),  LAD  ((b),  (e),  (h)), 

and  Huber  ((c),  (f),  (*')). 


391 
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Remarks  on  statistical  properties 

In  general,  the  equations  (5.60)  to  compute  the  estimator  ft  are  non-linear  and 
should  be  solved  by  numerical  methods.  The  initial  estimate  is  of  importance 
and  it  is  advisable  to  use  a  robust  initial  estimate,  even  if  this  may  be  ineffi¬ 
cient.  The  statistical  properties  of  this  estimator  can  be  derived  by  noting 
that  (5.60)  corresponds  to  a  GMM  estimator  with  moment  functions 
gi  =  —g(e,)xl  (see  Section  4.4.2  (p.  253-4)).  The  estimator  is  consistent  and, 
in  large  enough  samples,  approximate  standard  errors  can  be  obtained  from 
the  asymptotic  results  on  GMM  in  Section  4.4.3  (p.  258),  provided  that 
E[g(sj)xj\  =  0,  see  (4.61)  (p.  253).  If  the  regressors  x,  are  not  stochastic  this 
gives  the  condition 


E[g(si)]  =  0. 


For  OTS,  with  g( e,)  =  £,-,  this  is  guaranteed  by  Assumption  2,  as  this  states  that 
E[s,]  =  0.  For  LAD  this  condition  means  that  P(fi,  >  0)  =  ?(£,-  <  0)  — that  is,  that 
the  median  of  the  distribution  of  £,  is  zero. 


Interpretation  of  robust  estimation  in  terms  of  ML 

Robust  estimation  can  also  be  interpreted  in  terms  of  maximum  likelihood 
estimation  by  an  appropriate  choice  of  the  probability  distribution  of  the  error 
terms  £,-.  Let  £,  have  density  function  f  and  let  /,  =  log  (/(e,)),  where  e,-  =  y,  —  x'/f 
for  a  given  estimate  ft  of  p.  Then  the  log-likelihood  is  given  by  log  L  = 
Y2  h  =  Y1  l°g  (f(ei)  )>  and  ML  corresponds  to  the  minimization  of  (—  logL)  with 
first  order  conditions 


9  log  L  _  GA  <9  log  (/>/))  =  GAdlog(/'(e,))gg/ =  ■Affg/) 
d'p  dp  j~(  de,  dp  f(e,) 

where  f'(ei)  =  df(ej)/dei  is  the  derivative  of  f.  This  corresponds  to  the  equations 
(5.60)  with 


g(e,-)  =  -f'(ej)/f{ei). 

In  practice  the  density  f  of  the  disturbances  is  unknown.  A  criterion  of  the 
type  (5.59)  can  be  interpreted  as  postulating  that  — f'(e,)/f(e,)  =  g(ei)  is  a  reason¬ 
able  assumption  to  estimate  /I.  For  OLS  this  leads  to  —  f'{ei)/f(ei)  =  with 
solution  f(a)  the  standard  normal  distribution.  For  LAD  this  gives 

—f'(ei)lf(e,)  =  ±1  (the  sign  of  e,),  with  solution  f(ei)  —  le^e>'  (see  also  Exercise 
5.14). 
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Interpretation  of  robust  estimation  in  terms  of  WLS 

The  estimator  f3  that  minimizes  (5.59)  can  also  be  interpreted  in  terms  of  weighted 
least  squares.  If  the  weights  w,  in  (5.61)  are  fixed,  then  these  equations  correspond 
to  the  first  order  conditions  for  minimizing  the  weighted  least  squares  criterion 
w/ef.  So  the  weights  w,  measure  the  relative  importance  of  the  squared 
errors  ef  in  determining  /?.  This  also  motivates  a  simple  iterative  method  for 
estimating  /f  by  means  of  the  (robust)  criterion  (5.59).  Start  with  weights 
w,  =  1,  i  =  1,  •  •  • ,  n,  and  estimate  /i  by  OLS  with  residuals  et.  Then  compute 
w,  =  g(e,j  and  estimate  [3  by  WLS.  Iterate  the  computation  of  residuals  et,  weights 
wt,  and  WLS  estimates  of  /f,  until  convergence. 


Appropriate  scaling  in  robust  estimation 

It  is  of  course  preferable  that  the  results  do  not  depend  on  the  chosen  scales  of 
measurement  of  the  variables.  Let  us  consider  the  effect  of  rescaling  the  dependent 
variable  y,-.  If  this  variable  is  replaced  by  y*  =  ayt,  with  a  a  given  constant,  then  we 
would  like  the  estimates  /?  to  be  replaced  by  f3*  =  a[3,  as  in  this  case  the  fitted  values 
are  related  by  y*  =  x'/T  =  ax'fi  =  ay-,.  The  criteria  OLS  and  LAD  satisfy  this 
requirement.  For  other  criterion  functions  this  requirement  is  satisfied  by  re¬ 
placing  (5.59)  by 


sCp)  =  E  G 


y,  -  x'iP 


where  a1  =  E[ej]  =  var (y,-).  For  instance,  for  the  Huber  criterion  (5.62)  this  means 
that  c  should  be  replaced  by  ca. 

In  practice  a  is  unknown  and  has  to  be  estimated.  The  usual  OLS  estimator 
of  the  variance  may  be  sensitive  for  outliers.  Let  m  denote  the  median  of 
the  n  residuals  e\,  -  ■  ■ ,  e„;  then  a  robust  estimator  of  the  standard  deviation  is 
given  by 


a  =  1.483  •  med(|e/  —  m\,  j  =  !,•••,  «),  (5.64) 

where  ‘med’  denotes  the  median  of  the  n  values  |e7  —  m\.  It  is  left  as  an  exercise  (see 
Exercise  5.12)  to  prove  (for  a  simple  case)  that  this  gives  a  consistent  estimator  of 
a  if  the  observations  are  normally  distributed. 

This  can  be  used  to  estimate  the  parameters  [3  and  a  by  an  iterative  two-step 
method.  In  the  first  step  a  is  fixed  and  j3  is  estimated  robustly,  with  corresponding 
residuals  e,.  In  the  second  step,  a  is  estimated  from  the  residuals  e,.  This  new 
estimate  of  a  can  be  used  to  compute  new  robust  estimates  of  ft,  and  so  on,  until 
the  estimates  converge. 
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Limiting  the  influence  of  observations  with  large  leverage 

Finally  we  note  that  by  an  appropriate  choice  of  the  function  g  in  the  criterion 
(5.60)  the  influence  of  large  residuals  can  be  limited,  but  the  explanatory  variables 
can  still  be  influential  because  of  the  linear  term  x,.  This  influence  may  also  be 
bounded  —  for  example,  by  replacing  the  ‘normal  equations’  (5.60)  by 


n 

^2gi(xi)g2 

i=i 


=  0. 


For  instance,  one  can  take  gi(xj)  =  g(x'i(X,X)~1Xj )  =  g(h ,)  with  h,  the  leverage  in 
(5.54)  and  with  g  chosen  as  in  (5.63)  with  c=k/n  (the  mean  value  of  the 
leverages). 


Exercises:  T:  5.12d,  5.14,  5.15;  E:  5.29e. 


5.6.5  Summary 

In  least  squares  the  deviations  from  the  postulated  relation  between  de¬ 
pendent  and  independent  variables  are  penalized  in  a  quadratic  way.  This 
means  that  observations  that  deviate  much  from  the  general  pattern  may 
have  an  excessive  influence  on  the  parameter  estimates.  To  investigate  the 
presence  of  such  influential  observations  and  to  reduce  their  influence,  one 
can  proceed  as  follows. 

•  A  first  impression  may  be  obtained  by  inspecting  the  histogram  of  the 
least  squares  residuals  and  by  the  Jarque-Bera  test  on  normality.  Note, 
however,  that  OLS  is  not  a  reliable  method  to  detect  influential  obser¬ 
vations. 

•  Influential  data  may  be  detected  by  considering  the  leverages,  studen- 
tized  residuals,  dffits,  and  dfbetas  of  the  individual  observations. 

•  If  some  of  the  observations  deviate  a  lot  from  the  overall  pattern,  one 
should  try  to  understand  the  possible  causes.  This  may  suggest,  for 
instance,  additional  relevant  explanatory  variables  or  another  choice 
for  the  distribution  of  the  disturbances.  In  some  cases  it  may  also  be  that 
some  of  the  reported  data  are  unreliable,  so  that  they  should  be  excluded 
in  estimation. 

•  If  the  deviating  observations  are  a  realistic  aspect  of  the  data  (as  is  the 
case  in  many  situations),  one  may  wish  to  limit  their  influence  by 
applying  a  robust  estimation  method. 
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•  The  choice  of  the  robust  estimation  method  (corresponding  to  solving 
the  equations  Yhsiei)xi  —  0)  can  be  based  on  ideas  concerning  appro¬ 
priate  weights  Wj  of  the  individual  observations  (by  taking  g(e,)  =  wye,) 
or  on  ideas  concerning  the  probability  distribution  f  of  the  disturbances 
(by  taking  g(et)  =  -f'{e,)/f(el)  where  f'(e,)  =  df(e,)/dej). 
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5.7  Endogenous  regressors  and 
instrumental  variables 


5.7.1  Instrumental  variables  and  two-stage  least  squares 

Motivation 

Until  now  we  have  assumed  either  that  the  regressors  x,  are  fixed  or  that 
they  are  stochastic  and  exogenous  in  the  sense  that  there  is  no  correlation 
between  the  regressors  and  the  disturbance  terms.  It  is  intuitively  clear  that, 
if  x,  and  e,  are  mutually  correlated,  it  will  be  hard  to  distinguish  their 
individual  contributions  to  the  outcome  of  the  dependent  variable  y,  = 
x'fi  +  e,.  In  Section  4.1.3  (p.  194-6)  we  showed  that  OLS  is  inconsistent  in 
this  situation. 

We  briefly  discuss  two  examples  that  will  be  treated  in  greater  detail  later 
in  this  section.  The  first  example  is  concerned  with  price  movements  on 
financial  markets.  If  we  relate  the  returns  of  one  financial  asset  y  (in  our 
example  AAA  bonds)  to  the  returns  of  another  asset  x  (in  our  example 
Treasury  Bill  notes)  by  means  of  the  simple  regression  model 


yi  =  a  +  fix,  +  £„ 


then  x,  and  £,  may  well  be  correlated.  This  is  the  case  if  the  factors  Sj 
that  affect  the  bond  rate,  such  as  the  general  sentiment  in  the  market,  also 
affect  the  Treasury  Bill  rate.  For  instance,  unforeseen  increased  uncertainties 
in  international  trade  may  have  a  simultaneous  upward  effect  both  on  bond 
rates  and  on  interest  rates.  We  will  consider  this  possible  endogeneity  of 
Treasury  Bill  rates  in  later  examples  in  this  section  (see  Examples  5.30,  5.32, 
and  5.33). 

As  a  second  example,  for  many  goods  the  price  and  traded  quantity  are 
determined  jointly  in  the  market.  A  higher  price  may  lead  to  lower  demand, 
whereas  a  higher  demand  may  lead  to  higher  prices.  If  we  relate  price  x  and 
quantity  y  by  the  simple  regression  model  y,  =  a  +  /lx,  +  £,-,  then  x,  and  £,- 
may  well  be  correlated.  For  instance,  the  demand  may  increase  because  of 
higher  wealth  of  consumers,  and  this  may  at  the  same  time  increase  the  price. 
We  will  consider  this  possible  endogeneity  of  the  price  by  considering  the 
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market  for  motor  gasoline  consumption  in  later  examples  (see  Examples  5.31 
and  5.34). 

OLS  requires  exogenous  regressors 

In  the  multivariate  regression  model  y-t  =  x'fi  +  s„  the  dependent  variable  y, 
is  modelled  in  terms  of  k  explanatory  variables  x-  =  (1,  Xu,  ■  ■  ■ ,  x^).  Under 
the  standard  Assumptions  1-6  of  Section  3.1.4  (p.  125),  it  is  assumed  that  y, 
is  a  random  variable,  but  that  the  values  of  x,  are  ‘fixed’.  In  many  situations 
the  outcomes  of  the  variables  x,  are  partly  random.  This  was  analysed  in 
Section  4.1  under  Assumption  1*  of  stability  —  that  is, 

X'X^j  =  Q  (5.65) 

exists  with  Q  a  k  x  k  invertible  matrix.  In  Section  4.1.3  (p.  194)  we  derived 
that  OLS  is  consistent  in  this  case  if  and  only  if  the  explanatory  variables  are 
(weakly)  exogenous  —  that  is,  the  variables  should  satisfy  the  orthogonality 
condition  plint^X's)  =  0.  In  this  case  the  results  that  were  obtained  under 
the  assumption  of  fixed  regressors  (including  the  diagnostic  analysis  in 
Sections  5. 1-5.6)  carry  over  to  the  case  of  stochastic  exogenous  regressors, 
by  interpreting  the  results  conditional  on  the  given  outcomes  of  the 
regressors  in  the  n  x  k  matrix  X.  For  instance,  the  statistical  properties 
E[b]  =  ft  and  var (b)  =  a2(X'X)_1  should  then  be  interpreted  as  E[b\X]  =  ft 
and  var(b|X)  =  <t2(X'X)-1.  This  was  also  discussed  in  Section  4.1.2 
(p.  191-2). 

Consequences  of  endogenous  regressors 

We  will  now  consider  the  situation  where  one  or  more  of  the  regressors  is 
endogenous  in  the  sense  that 


plim^-X'e^  0.  (5.66) 

This  means  that  the  random  variation  in  X  is  correlated  with  the  random 
variation  e  in  y.  In  such  a  situation  it  is  difficult  to  isolate  the  effect  of  X  on  y 
because  variations  in  X  are  related  to  variations  in  y  in  two  ways,  directly  via 
the  term  Xfi  but  also  indirectly  via  changes  in  the  term  e.  For  instance,  in  a 
cross  section  of  cities  the  per  capita  crime  (y)  may  very  well  be  positively 
correlated  with  the  per  capita  police  force  (x),  in  which  case  a  regression  of  y 
on  x  gives  a  positive  OLS  estimate  of  the  effect  of  police  on  crime.  The  reason 
is  that  in  the  model  y,  =  a  +  px,  +  e,  cities  with  high  crime  rates  (e,  >  0)  tend 
to  have  larger  police  forces  (values  of  x,  larger  than  average).  Clearly,  in  such 
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a  situation  the  effect  of  police  on  crime  cannot  be  estimated  reliably  by  OLS 
(see  also  Exercise  5.23).  Stated  in  statistical  terms,  if  one  or  more  of  the 
regressors  is  endogenous,  then  OLS  is  no  longer  consistent  and  the  conven¬ 
tional  results  (t-test,  F- tests,  diagnostic  tests  in  previous  sections  of  this 
chapter)  are  no  longer  valid. 

The  use  of  instruments 

A  consistent  estimator  can  be  obtained  if  one  can  identify  instruments.  A  set 
of  m  observed  variables  z\  =  (zi„  ■  ■  ■ ,  Zml)  is  called  a  set  of  instruments  if  the 
following  three  conditions  are  satisfied,  where  Z  denotes  the  n  x  m  matrix 
with  rows  z\,i  =  !,•••,  n\ 


-Z'e  )  =  0,  (5.67) 

n  J 

plim^Z'X^  =  Qzx,  rank  (Qzx)  =  k,  (5.68) 

plimQz'Z^  =  Qzz,  rank(Qzz)  =  m.  (5.69) 

The  condition  (5.67)  means  that  the  instruments  should  he  exogenous.  For 
example,  this  is  satisfied  (under  weak  additional  conditions)  when  the  instru¬ 
ments  are  uncorrelated  with  the  disturbances  in  the  sense  that 

E[z,e,]  =  0,  *'=1,  •••,«.  (5.70) 

The  condition  (5.68)  means  that  the  instruments  should  be  sufficiently  cor¬ 
related  with  the  regressors.  This  is  called  the  rank  condition.  As  Qzx  is  an 
m  x  k  matrix,  this  requires  that  m  >  k  —  that  is,  the  number  of  instruments 
should  be  at  least  as  large  as  the  number  of  regressors.  This  is  called  the  order 
condition  for  the  instruments.  The  stability  condition  (5.69)  is  similar  to 
(5.65). 

How  to  find  instruments? 

Before  we  describe  the  instrumental  variable  estimator  (below)  and  its  statis¬ 
tical  properties  (in  the  next  section),  we  first  discuss  the  question  of  how  to 
find  instruments.  First  of  all,  one  should  analyse  which  of  the  explanatory 
variables  are  endogenous.  If  the  ;th  explanatory  variable  is  exogenous, 
so  that  plim(i^"=1  x;,e()  =  0,  then  this  variable  should  he  included  in  the 
set  of  instruments.  For  instance,  the  constant  term  should  always  be 
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included,  together  with  all  other  exogenous  regressors.  If  ko  of  the  regressors 
are  endogenous,  one  should  find  at  least  ko  additional  instruments.  One 
option  is  to  formulate  additional  equations  that  explain  the  dependence  of 
the  endogenous  variables  in  terms  of  exogenous  variables.  This  leads  to 
simultaneous  equation  models  that  are  discussed  in  Chapter  7.  In  many 
cases  it  is  too  demanding  to  specify  such  additional  equations,  and  instead 
one  selects  a  number  of  variables  that  are  supposed  to  satisfy  the  conditions 
(5.67)— (5.69).  In  Section  5.7.3  we  describe  a  test  for  the  validity  of  these 
conditions. 

In  practice  the  choice  of  instruments  is  often  based  on  economic  insight,  as 
we  will  illustrate  by  means  of  two  examples  at  the  end  of  this  section. 


Derivation  of  IV  estimator 

To  describe  the  instrumental  variable  (IV)  estimator. ,  we  assume  that  condition 
(5.70)  is  satisfied.  This  corresponds  to  m  moment  conditions.  The  IV  estimator  is 
defined  as  the  GMM  estimator  corresponding  to  these  moment  conditions.  In  the 
exactly  identified  case  (m  =  k),  the  IV  estimator  biv  is  given  by  the  solution  of  the 
m  equations  zi(yi  —  x'ibiv)  =  0  —  that  is, 


(n  \  1  n 

Y]  Zix'j  j  Y2  z,y‘  =  ( Z'xylz'y . 

In  the  over-identified  case  (m  >  k),  the  results  in  Section  4.4.3  (p.  256)  show  that 
the  efficient  estimator  corresponding  to  these  moment  conditions  is  obtained  by 
weighted  least  squares.  More  particularly,  the  GMM  criterion  function  ^G^WG,, 
in  (4.63)  with  G„  =  Ym=i  z'(y>  ~  x;/b  =  Z'(y  —  Xp)  leads  to  the  criterion  function 

S(P)  =  -  (y  —  XP)'ZWZ'(y  —  XP), 
n 

where  the  weighting  matrix  W  is  equal  to  the  inverse  of  the  covariance  matrix 
/*  =  ElziEiiziSi)1]  =  E[efzizfj].  Under  weak  regularity  conditions,  a  consistent  esti¬ 
mator  of  these  weights  is  given  by  W  =  /-1,  where  /  =  77  YTi= 1  z‘^i  =  the 

scale  factor  a1  has  no  effect  on  the  location  of  the  minimum  of  S(/J),  we  obtain  the 
criterion  function 

Siv m  =  (y-  X/])'Z(Z'Z)-lZ'(y  -  XP)  =  (y-  XpYPz(y  -  Xp),  (5.71) 

where  Pz  =  Z(Z'Z)_1Z'  is  the  projection  matrix  corresponding  to  regression  on 
the  instruments  Z.  The  first  order  conditions  for  a  minimum  are  given  by 

“jp  =  -2x'rz(,-x«  =  o. 
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The  IV  estimator  and  two-stage  least  squares 

The  foregoing  analysis  shows  that  the  IV  estimator  is  given  by 

biv  =  (X,PzX)~1X'Pzy-  (5.72) 

This  estimator  has  an  interesting  interpretation.  We  are  interested  in  the 
coefficients  ft  in  y  =  X/?  +  e,  but  OLS  is  inconsistent  because  of  (5.66).  If 
the  regressors  would  have  been  Z  instead  of  X,  then  (5.67)  means  that  OLS  is 
consistent.  The  idea  is  to  replace  X  by  linear  combinations  of  Z  that  approxi¬ 
mate  X  as  well  as  possible.  This  best  approximation  is  obtained  by  regressing 
every  column  of  X  on  the  instruments  matrix  Z.  The  fitted  values  of  this 
regression  are 


X  =  Z(Z'Z)-'Z'X  =  PZX.  (5.73) 

Then  /J  is  estimated  by  regressing  y  on  X,  which  gives  the  following  estimator 
of  /L 


(X'X)_1X'y  =  (X'PzX)_1X'Pzy  =  b,y  (5.74) 

So  the  IV  estimator  can  be  computed  by  two  successive  regressions.  The  IV 
estimator  is  therefore  also  called  the  two-stage  least  squares  estimator,  ab¬ 
breviated  as  2SLS. 


Two-stage  least  squares  estimates  of  the  parameters  p  (2SLS) 

•  Stage  1 .  Regress  each  column  of  X  on  Z,  with  fitted  values  X  = 
Z{Z'Z)-lZ'X. 

•  Stage  2.  Regress  y  on  X,  with  parameter  estimates  fe/y  =  (X'X)_1X'y. 
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Example  5.30:  Interest  and  Bond  Rates  (continued) 

As  an  illustration  we  consider  the  interest  and  bond  rate  data  introduced  in 
Example  5.11.  We  will  discuss  (i)  the  possible  endogeneity  of  the  explanatory 
variable  (the  interest  rate),  (ii)  a  suggestion  for  possible  instruments,  and  (iii) 
the  results  of  IV  estimation  with  these  instruments. 

(i)  Possible  endogeneity  of  the  interest  rate 

In  foregoing  sections  we  analysed  the  relation  between  monthly  changes  in 
the  AAA  bond  rate  (y,)  and  in  the  short-term  interest  rate  (the  three-month 
Treasury  Bill  rate,  (x;))  by  the  model 


yi  =  a  +  fix,  +  £,-,  i  =  1,  •  •  • ,  n. 
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It  may  very  well  be  that  the  factors  s,  that  cause  changes  in  the  AAA  bond  rate 
reflect  general  financial  conditions  that  also  affect  the  Treasury  Bill  rate.  If 
this  is  the  case,  then  x,  is  not  exogenous  and  OLS  is  not  consistent. 


Panel  1:  Correlogram  of  explanatory  var.  DUS3MT 

Sample  1980.01-1999.12  (240  observations) 

Lag 

AC 

Q-Stat 

Prob 

1 

0.279 

18.938 

0.000 

2 

-0.185 

27.259 

0.000 

3 

-0.155 

33.156 

0.000 

4 

-0.102 

35.713 

0.000 

5 

0.037 

36.056 

0.000 

6 

-0.167 

42.972 

0.000 

7 

-0.157 

49.110 

0.000 

8 

0.155 

55.157 

0.000 

9 

0.264 

72.725 

0.000 

10 

0.048 

73.318 

0.000 

11 

-0.110 

76.382 

0.000 

12 

-0.247 

91.897 

0.000 

Panel  2:  Dependent  Variable:  DUS3MT 
Method:  Least  Squares 
Sample:  1980:01  1999:12 
Included  observations:  240 


Variable 

Coefficient 

Std.  Error 

t- Statistic 

Prob. 

C 

-0.026112 

0.039009 

-0.669400 

0.5039 

DUS3MT(— 1) 

0.358145 

0.062307 

5.748060 

0.0000 

DUS3MT(— 2) 

-0.282601 

0.062266 

-4.538625 

0.0000 

R-squared 

0.151651 

Panel  3:  Dependent  Variable:  DAAA 

Method:  Least  Squares 

Sample:  1980:01  1999:12 
Included  observations:  240 

Variable 

Coefficient 

Std.  Error 

t- Statistic 

Prob. 

C 

-0.004558 

0.015440 

-0.295200 

0.7681 

DUS3MT 

0.306453 

0.023692 

12.93503 

0.0000 

R-squared 

0.412803 

Panel  4:  Dependent  Variable:  DAAA 
Method:  Instrumental  Variables 
Sample:  1980:01  1999:12 
Included  observations:  240 


Instrument  list:  C  DUS3MT(- 

1)  DUS3MT(- 

-2) 

Variable 

Coefficient 

Std.  Error 

t- Statistic 

Prob. 

C 

-0.008453 

0.016572 

-0.510085 

0.6105 

DUS3MT 

0.169779 

0.064952 

2.613906 

0.0095 

R-squared 

0.330694 

Exhibit  5.43  Interest  and  Bond  Rates  (Example  5.30) 

Correlations  of  explanatory  variable  (DUS3MT)  with  its  lagged  values  (Panels  1  and  2)  and 
regression  model  estimated  by  OLS  (Panel  3)  and  by  IV  (Panel  4). 
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(ii)  Possible  instruments 

If  financial  markets  are  efficient,  this  means  that  all  past  information  is 
processed  in  the  current  prices.  In  this  case  the  current  value  of  e,  is  uncorrel¬ 
ated  with  the  past  values  of  both  yt-j  and  x,-j  for  all  /  >  1 .  We  will  assume 
that  the  disturbance  term  £,  is  correlated  with  the  current  change  x,  in 
the  Treasury  Bill  rate,  but  not  with  past  changes  x,-\,  x,_ 2,  and  so  on. 
Then  these  past  changes  can  serve  as  instruments.  In  Example  5.33  we 
will  test  the  exogeneity  condition  (5.67)  —  that  is,  the  condition  that 
E[xj-iSj]  =  E[x,-2£j]  =  0. 

(iii)  Results  of  IV  estimation 

We  now  analyse  the  interest  and  bond  rate  data  over  the  period  from  January 
1980  to  December  1999  ( n  =  240).  To  check  the  rank  condition  (5.68), 
Panel  1  of  Exhibit  5.43  shows  that  the  variable  x,  is  correlated  with  its 
past  values.  As  instruments  we  take  xt-\  and  x,_2,  the  one-  and  two-month 
lagged  changes  in  the  Treasury  Bill  rate.  The  regression  of  x,  on  x,_i  and  x,_2 
has  an  R2  =  0.15  (see  Panel  2  of  Exhibit  5.43).  The  condition  (5.68)  is 
satisfied,  although  the  correlations  are  not  so  large.  Panel  4  reports  the 
IV  estimates  with  instruments  z\  =  (1,  x,_i,  x,-i),  and  for  comparison 
Panel  3  reports  the  OLS  estimates.  The  estimates  of  the  slope  parameter  /I 
differ  quite  substantially.  A  further  analysis  is  given  in  Example  5.33  at  the 
end  of  Section  5.7.3. 
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Example  5.31 :  Motor  Gasoline  Consumption 

For  many  goods,  the  price  and  traded  quantities  are  determined  jointly  in  the 
market  process.  It  may  well  be  that  price  and  quantity  influence  each  other, 
with  higher  prices  leading  to  lower  demand  and  with  higher  demand  leading 
to  higher  prices.  We  will  analyse  this  for  the  market  of  motor  gasoline  in  the 
USA.  We  will  discuss  (i)  the  data  and  (ii)  possible  instruments  and  corres¬ 
ponding  IV  estimates. 

(i)  The  data 

We  consider  the  relation  between  gasoline  consumption,  gasoline  price,  and 
disposable  income  in  the  USA.  Yearly  data  on  these  variables  and  three  price 
indices  (of  public  transport,  new  cars,  and  used  cars)  are  available  over  the 
period  1970-99.  Exhibit  5.44  (a-c)  shows  time  plots  of  these  three  variables 
(all  in  logarithms),  and  a  scatter  diagram  ( d )  and  a  partial  scatter  diagram  (e) 
(after  removing  the  influence  of  income)  of  consumption  against  price. 

We  are  interested  in  the  demand  equation  for  motor  gasoline,  in  particular, 
in  the  effects  of  price  and  income  on  demand.  We  postulate  the  linear  demand 
function 


5.7  Endogenous  regressors  and  instrumental  variables  403 


(a)  ( b )  (c) 


-0.4  -0.2  0.0  0.2  0.4  0.6  -0.4  -0.2  0.0  0.2  0.4 
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Exhibit  5.44  Motor  Gasoline  Consumption  (Example  5.31) 

Time  plots  of  real  gasoline  consumption  (GC  (a)),  real  gasoline  price  (PG  (. b ))  and  real  income 
(RI  (c)),  all  in  logarithms,  and  scatter  diagram  of  consumption  against  price  (d)  and  partial 
scatter  diagram  after  removing  the  influence  of  income  (e). 

GCi  =  a  +  fiPGj  + y  R I i  +  e,,  i  =  1,  •  ■  ■ ,  30, 

where  GC  stands  for  gasoline  consumption,  PG  for  the  gasoline  price  index, 
and  RI  for  disposable  income  (all  measured  in  real  terms  and  taken  in 
logarithms). 

The  USA  is  a  major  player  on  the  world  oil  market,  so  that  the  fluctu¬ 
ations  Ej  in  US  gasoline  consumption  could  affect  the  gasoline  price.  If  this 
is  the  case,  then  PG  is  not  exogenous,  and  OLS  provides  inconsistent 
estimates. 

(ii)  Possible  instruments  and  corresponding  IV  estimates 

As  possible  instruments  we  consider  (apart  from  the  constant  term  and  the 
regressor  RI)  the  real  price  indices  of  public  transport  (RPT),  of  new  cars 
( RPN ),  and  of  used  cars  (RPU).  In  Example  5.34  we  will  test  whether  these 
variables  are  indeed  exogenous. 

Exhibit  5.45  shows  the  results  of  OLS  (in  Panel  1)  and  IV  (in  Panel  2).  The 
estimates  do  not  differ  much,  which  can  be  taken  as  an  indication  that  the 
gasoline  price  can  be  considered  as  an  exogenous  variable  for  gasoline 
consumption  in  the  US.  In  Example  5.34  we  will  formally  test  whether  the 
price  is  exogenous  or  endogenous. 
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Panel  1:  Dependent  Variable:  GC 
Method:  Least  Squares 
Sample:  1970  1999 
Included  observations:  30 

Variable  Coefficient  Std.  Error  t-Statistic  Prob. 

C  4.985997  0.081101  61.47914  0.0000 

PG  -0.527578  0.026319  -20.04565  0.0000 

RI _ 0.573220  0.024511  23.38644  0,0000 

R-squared  0.987155 


Panel  2:  Dependent  Variable:  GC 
Method:  Instrumental  Variables 
Sample:  1970  1999 
Included  observations:  30 
Instrument  list:  C  RPT  RPN  RPU  RI 


Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

5.013700 

0.083911 

59.75035 

0.0000 

PG 

-0.544450 

0.028950 

-18.80669 

0.0000 

RI 

0.564662 

0.025389 

22.24005 

0.0000 

R-squared 

0.986959 

Panel  3:  Dependent  Variable:  PG 

Method:  Least  Squares 

Sample:  1970  1999 

Included  observations:  30 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

7.740963 

0.833698 

9.285095 

0.0000 

RPT 

-0.808004 

0.191221 

-4.225499 

0.0003 

RPN 

-3.527853 

0.351973 

-10.02308 

0.0000 

RPU 

0.233078 

0.183108 

1.272898 

0.2148 

RI 

-2.298421 

0.247071 

-9.302668 

0.0000 

R-squared 

0.886815 

Exhibit  5.45  Motor  Gasoline  Consumption  (Example  5.31) 

OLS  of  gasoline  consumption  (GC)  on  price  of  gasoline  (PG)  and  income  (RI)  (Panel  1),  IV of 
consumption  on  price  and  income  using  five  instruments,  the  constant  term,  income,  and  three 
real  price  indices  (of  public  transport  (RPT),  new  cars  (RPN),  and  used  cars  (RPU)  (Panel  2) ), 
and  relation  between  gasoline  price  and  the  five  instruments  (Panel  3). 


Exercises:  T:  5.16a,  b,  5.18;  S:  5.23a-d. 


5.7.2  Statistical  properties  of  IV  estimators 


Derivation  of  consistency  of  IV  estimators 

We  consider  the  properties  of  the  IV  estimator  (5.72)  for  the  model  y  =  X[>  +  £ 
with  n  x  m  instrument  matrix  Z.  Referring  to  Section  3.1.4  (p.  125-6),  we 
suppose  that  Assumptions  2-6  are  satisfied  and  that  Assumption  1  is  replaced 
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by  the  five  (asymptotic)  conditions  (5.65)— (5.69).  Under  these  conditions  the  IV 
estimator  is  consistent.  To  prove  this,  we  write  (5.72)  as 


biv 


(X'ZiZ'ZT'Z'X)  1XZ{Z'Z)~1Z'{XP  +  e) 

p  +  (-x'z(-z'z)  1z'x )  1  x'zf1  z'z 

\n  \n  ]  n  n  \n 


n 


(5.75) 


Because  of  the  conditions  (5.67)-(5.69)  we  obtain  the  probability  limit  of  biv  as 


plim  (bIV)  =  P  +  (Q^Q-1Q«)“1Q^Q-1 0  =  p. 


This  shows  that  the  exogeneity  of  the  instruments  Z  is  crucial  to  obtain  consist¬ 
ency.  Note  that  the  IV  estimator  is  also  consistent  if  Assumptions  3  and  4  are  not 
satisfied  (that  is,  for  heteroskedastic  or  serially  correlated  errors),  as  long  as  the 
instruments  are  exogenous.  However,  Assumptions  3  and  4  are  needed  in  our 
derivation  of  the  asymptotic  distribution  of  biv- 


Derivation  of  asymptotic  distribution 

We  will  assume  (in  analogy  with  (4.6)  in  Section  4.1.4  (p.  196))  that 


~i=Z'e,  N(0,  (t2Qzz). 

V» 

Using  the  notation  fe/y  =  /l  +  A„(  y  Z’s)  for  the  last  expression  in  (5.75), 
we  can  rewrite  (5.75)  as  y/n(biy  —  ft)  =  An  ■  ^Z’s,  where  A„  has  probability 
limit  A  =  (Q'zxQzz  Qzx)1  Q'zxQzz  •  Combining  these  results  and  using 
AQZZA'  =  (Q'ucQzzQzxT1  gives 


Mbiv  -P)±  n(o,  o2(q'zxq:2qzx)-1). 

In  large  enough  finite  samples,  biv  is  approximately  normally  distributed  with 
mean  ft  and  covariance  matrix  ^-(iX'ZfiZ'Z)-1  ^Z'Xp1  =  <r2(X'PzX)_1.  With 
the  notation  (5.73)  this  gives 

bIV  w  N(j9,  ^(X'PzX)-1)  =  N(j8,  ^(X'X)-1) .  (5.76) 

The  instrumental  variable  estimator  is  relatively  more  efficient  if  the  instruments 
Z  are  more  highly  correlated  with  the  explanatory  variables.  In  practice,  the 
exogeneity  condition  (5.67)  is  often  satisfied  only  for  variables  that  are  relatively 
weakly  correlated  with  the  explanatory  variables.  Such  weak  instruments  lead  to 
relatively  large  variances  of  the  IV  estimator. 
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To  use  the  above  results  in  testing  we  need  a  consistent  estimator  of  the  variance 
a2.  Let  eiy  =  y  —  Xbiy  be  the  IV  residuals;  then  a  consistent  estimator  is  given  by 

S/v  =  n  _  ^  e'i veiv  =  n  _  ^  (y  —  XbIV)'(y  —  Xbiy )■  (-5-77) 

If  the  IV  estimator  is  computed  as  in  (5.74)  — that  is,  by  regressing  y  on  X — then 
the  conventional  OLS  expression  for  the  covariance  matrix  is  not  correct.  This 
would  give  s2(X'X)_1  with  s2  =  ^z%(y  —  Xbiy)'[y  —  Xbiy),  and  this  estimator  of 
a2  is  not  consistent  (see  Exercise  5.16). 


Remark  on  finite  sample  statistical  properties 

The  above  analysis  is  based  on  asymptotic  results.  As  concerns  finite  sample 
properties,  we  mention  that  in  finite  samples  the  pth  moments  of  biy  exist  if  and 
only  if  p  <  m  —  k  +  1.  In  the  exactly  identified  case  there  holds  m  —  k,  so  that  the 
finite  sample  probability  distribution  of  bjy  does  not  have  a  well-defined  mean  or 
variance.  The  covariance  matrix  of  biy  exists  if  and  only  if  m  >  k  +  2.  This  result 
could  suggest  that  it  is  always  best  to  incorporate  as  many  instruments  as  possible. 
Adding  instruments  also  leads  to  asymptotically  smaller  variances,  provided  that 
all  additional  instruments  are  exogenous.  However,  if  the  additional  instruments 
are  weak,  then  the  finite  sample  distribution  may  very  well  deteriorate.  In  practice  it 
is  often  better  to  search  for  a  sufficient  number  of  good  instruments  than  for  a  large 
number  of  relatively  weak  instruments. 


Derivation  of  the  F-test  in  IV  estimation 

Tests  on  the  individual  significance  of  coefficients  can  be  performed  by 
conventional  t-tests  based  on  (5.76)  and  (5.77).  An  T-test  for  joint  linear  restric¬ 
tions  can  be  performed  along  the  lines  of  Section  3.4.1  (p.  161-2).  To  derive 
the  expression  for  this  test  we  use  some  results  of  matrix  algebra  (see  Appendix  A, 
Section  A.6  (p.  737)).  There  it  is  proved  that  the  n  x  n  projection  matrix 
Pz  =  Z(Z'Z)_1Z'  of  rank  m  can  be  written  in  terms  of  an  m  x  n  matrix  K  as 


Pz  =  K'K,  with  KK'  =  Im, 

where  Im  is  the  m  x  m  identity  matrix.  Define  the  m  x  1  vector  y*  =  Ky  and  the 
m  x  k  matrix  X*  =  KX.  The  instrumental  variable  criterion  (5.71)  can  then  be 
written  as 


SMP)  =  (y~  Xf})'K'K(y  -  XP)  =  (y*  -  X*p)'(y*  -  X*j8). 
If  y  =  Xfl  +  e  with  £  ~  N(0,  a2I„),  then 


y*  =  X*P  +  £*,  £*  ~  N(0,  (7 2 XX')  =  N(0,  <J2Im). 
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This  shows  that  IV  estimation  in  the  model  y  =  Xfl  +  e  is  equivalent  to  applying 
OLS  in  the  transformed  model  y*  =  X’[S  +  e*.  Let  the  unrestricted  IV estimator  be 
denoted  by  biy  and  the  restricted  IV  estimator  by  biuy,  with  corresponding 
residuals  e*  =  y*  —  X*biv  =  K(y  —  Xbiy)  =  Keiy  and  e*R  =  y*  —  X*bRiy  = 
K(y  —  XbRiv)  =  KeRiv •  If  the  g  restrictions  of  the  null  hypothesis  hold  true,  then 
the  results  in  Section  3.4.1  (p.  161-2)  imply  that 

(e*Re*R  -  e*V)/rr2  =  (e'RIVK' KeRIy  -  e'IVK'KeIV) /a2 

=  ( e'RlvPzeRiv  ~  e'lvF zeiv)  / a2  ~  Z2(g)- 

If  we  replace  a2  by  the  consistent  estimator  (5.77),  then  we  get 


P  _  {e'RiyPzeRiy  -  e'IVPzeiv)/g 
e'IveIV/(n  -  k) 


^(g,  n-k)  ■ 


This  differs  from  the  standard  expression  (3.50)  for  the  F- test,  as  in  the  numerator 
the  IV  residuals  are  weighted  with  Pz. 


Computation  of  the  F-test 

It  is  computationally  more  convenient  to  perform  the  following  regressions. 
First  regress  every  column  of  X  on  Z  with  fitted  values  X  as  in  (5.73).  Then 
perform  two  regressions  of  y  on  X,  one  without  restrictions  (with  residuals 
denoted  by  e)  and  one  with  the  restrictions  of  the  null  hypothesis  imposed 
(with  residuals  denoted  by  £r).  Then 

(e'ReR-e'e)/g 

F-<w  <„-*)•  (5'7S) 

The  proof  that  this  leads  to  the  same  F-value  as  the  foregoing  expression  is 
left  as  an  exercise  (see  Exercise  5.16). 

Example  5.32:  Interest  and  Bond  Rates  (continued) 

We  continue  our  previous  analysis  of  the  interest  and  bond  rate  data  in 
Example  5.30.  The  model  is 


y,  =  a  +  Pxj  +  £;, 

with  y,  the  monthly  AAA  bond  rate  changes  and  x,  the  monthly  Treasury  Bill 
rate  changes.  As  instruments  we  take  again  z\  =  (1,  xt-\,  x,-_ 2).  Now  we  test 
whether  the  AAA  bond  rate  will  on  average  remain  the  same  if  the  Treasury 
Bill  rate  is  fixed.  This  seems  to  be  a  natural  assumption.  So  we  test  the  null 
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hypothesis  that  a  =  0.  Panel  1  of  Exhibit  5.46  shows  the  t-value  obtained  by 
IV  —  that  is,  tiv(6c)  —  —0.510.  So  the  null  hypothesis  is  not  rejected.  Note 
that  if  we  compute  the  IV  estimate  by  regressing  y  on  X  as  in  (5.74),  then  the 
reported  1-value  becomes  —0.42  (see  Panel  3  of  Exhibit  5.46),  so  this  1-value 
is  not  correct.  Exhibit  5.46  also  contains  the  regressions  needed  for  (5.78), 
with  sums  of  squared  residuals  e'ReR  —  22.717  (Panel  4),  e'e  =  22.700  (Panel 
3),  and  e'lveiy  =  15.491  (Panel  1).  So  the  F- test  for  a  =  0  becomes 


Panel  1:  Dependent  Variable:  DAAA 
Method:  Instrumental  Variables 
Sample:  1980:01  1999:12 
Included  observations:  240 


Instrument  list:  C  DUS3MT(- 

■1)  DUS3MT(- 

-2) 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-0.008453 

0.016572 

-0.510085 

0.6105 

DUS3MT 

0.169779 

0.064952 

2.613906 

0.0095 

R-squared 

0.330694 

Sum  squared  resid 

15.49061 

Panel  2:  Dependent  Variable:  DUS3MT 

Method:  Least  Squares 

Sample:  1980:01  1999:12 

Included  observations:  240 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-0.026112 

0.039009 

-0.669400 

0.5039 

DUS3MT(— 1) 

0.358145 

0.062307 

5.748060 

0.0000 

DUS3MT(— 2) 

-0.282601 

0.062266 

-4.538625 

0.0000 

R-squared 

0.151651 

Sum  squared  resid 

86.30464 

Panel  3:  Dependent  Variable:  DAAA 

Method:  Least  Squares 

Sample:  1980:01  1999:12 

Included  observations:  240 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

XHAT 

-0.008453 

0.169779 

0.020060 

0.078626 

-0.421374 

2.159311 

0.6739 

0.0318 

R-squared 

0.019214 

Sum  squared  resid 

22.69959 

Panel  4:  Dependent  Variable:  DAAA 
Method:  Least  Squares 

Sample:  1980:01  1999:12 

Included  observations:  240 

Variable  Coefficient 

Std.  Error 

t-Statistic 

Prob. 

XHAT  0.173480 

0.078000 

2.224107 

0.0271 

R-squared  0.018483 

Sum  squared  resid 

22.71653 

Exhibit  5.46  Interest  and  Bond  Rates  (Example  5.32) 

Model  for  AAA  bond  rates  estimated  by  IV  (Panel  1),  first  step  of  2SLS  (Panel  2,  construction 
of  X,  denoted  by  XHAT,  by  regressing  DUS3MT  on  three  instruments  —  that  is,  the  constant 
term  and  the  1  and  2  lagged  values  of  DUS3MT),  second  step  of  2SLS  (Panel  3,  regression  of 
AAA  bond  rates  on  XHAT),  and  regression  of  AAA  bond  rates  on  XHAT  in  restricted  model 
without  constant  term  (Panel  4).  The  sum  of  squared  residuals  in  Panels  1, 3,  and  4  is  used  in  the 
F-test  for  the  significance  of  the  constant  term. 
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(22.717-  22.700)/l 
15.491/(240-2) 


0.261,  P  =  0.611. 


This  is  equal  to  the  square  of  the  IV  t-value  in  Panel  1 ,  F  =  tjv( a),  and  both 
tests  do  not  lead  to  rejection  of  the  hypothesis  that  a  =  0. 

Exercises:  T:  5.16c,  e. 


5.7.3  Tests  for  exogeneity  and  validity 
of  instruments 

Motivation  of  exogeneity  tests 

If  some  of  the  regressors  are  endogenous,  then  OLS  is  not  consistent  but  IV  is 
consistent.  On  the  other  hand,  if  the  regressors  are  exogenous,  then  OLS  is 
consistent  and  (under  the  usual  assumptions)  more  efficient  than  IV  in  the 
sense  that  var (biv)  >  var (b),  because  ( X'PzX )_1  >  (X'X)-1  (see  Exercise 
5.16).  So  OLS  will  be  preferred  if  the  regressors  are  exogenous  (or  weakly 
endogenous  in  the  sense  that  the  correlations  in  (5.66)  are  small)  and  IV  will 
be  better  if  the  regressors  are  (too  strongly)  endogenous.  The  choice  between 
these  two  estimators  can  be  based  on  a  test  for  the  exogeneity  of  the 
regressors.  So  we  want  to  test  the  null  hypothesis  of  exogeneity  —  that  is, 

-X'e  )  =  0,  (5.79) 

n  J 

against  the  alternative  of  endogeneity  (5.66)  that  plim(  f  X'e)  ^  0.  If  the 
assumption  of  exogeneity  is  not  rejected,  we  can  apply  OLS,  otherwise  it 
may  be  better  to  use  IV  to  prevent  large  biases  due  to  the  inconsistency  of 
OLS  for  endogenous  regressors. 


Derivation  of  test  based  on  comparison  of  OLS  and  IV 

A  simple  idea  is  the  following.  If  the  regressors  in  y  =  Xfi  +  e  are  exogenous,  then 
OLS  and  IV  are  both  consistent  and  the  respective  estimators  b  and  biv  of  /f  should 
not  differ  very  much  (in  large  enough  samples).  This  suggests  basing  the  test  on 
the  difference  d  =  biv  —  b.  Using  (5.74)  and  the  fact  that  X'X  =  X'PzX  =  X'X, 
we  get 


b  =  (X'X)_1X'y  =  f,  +  (X'X)_1X'e, 
bIV  =  (X'X)_1X'y  =  p  +  (X'X^X'e. 
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So  d  =  biv  —  b  =  ( ^X'X)  1  ^X'e  —  ( 1  X'X)  1 1  XT.  If  the  null  hypothesis  (5.79) 
holds  true,  then  E[ d]  «  0  and 

var(d)  =  var((X'X)'1X'£  -  (X'X^X'e) 

ss  ^((x'x)-1^;'  -  (x'x^x'Xfx'xr1^'  -  (x'x^x')' 

=  ^((X'Xr1  -  (X'X)-1)  w  var(biv)  -  var (b), 

where  we  used  that  var(e)  =  a2I  and  X'X  =  X'X.  Under  the  usual  assumptions,  d 
is  also  asymptotically  normally  distributed  so  that  (under  the  null  hypothesis  of 
exogeneity) 


(biv  ~  by  (var(bIV)  -  var (b))  1(bIV  -  b)  «  ^{k). 

This  test  is  easy  to  apply,  as  OLS  of  y  on  X  gives  b  and  an  estimate  of 
var  (b),  and  OLS  of  y  on  X  gives  biv  and  an  estimate  of  var  (biv)  (see  (5.76)). 
However,  in  finite  samples  the  estimated  covariances  may  be  such  that 
(var (biv)  —  var (b))  is  not  positive  semidefinite,  in  which  case  the  variance  of  d  is 
very  badly  estimated  and  the  test  as  computed  above  does  not  have  a  good 
interpretation. 


Derivation  of  exogeneity  test  of  Durbin,  Wu,  and  Hausman 

Usually  exogeneity  is  tested  in  another  way.  We  will  now  describe  an  exogeneity 
test  associated  with  Durbin,  Wu,  and  Hausman,  commonly  known  as  the  Haus¬ 
man  test.  This  test  corresponds  to  the  Lagrange  Multiplier  test.  The  main  idea  is  to 
reformulate  the  exogeneity  condition  (5.79)  in  terms  of  a  parameter  restriction. 
For  this  purpose  we  split  the  regressors  into  two  parts,  the  ko  variables  that  are 
possibly  endogenous  and  the  other  (k  —  ko)  variables  that  are  exogenous  (for 
instance,  the  constant  term).  We  order  the  regressors  so  that  the  first  (k  —  ko) 
ones  are  exogenous  and  the  last  ko  ones  are  potentially  endogenous.  The  null 
hypothesis  of  exogeneity  of  these  regressors  is  formulated  as 


E[xji£i]  =  0,  /  =  k  -  ko  +  1,  •  •  • ,  k. 

By  assumption,  the  m  instruments  Zi  satisfy  the  exogeneity  condition  (5.70)  that 
£[z,£,1  =  0.  Now  we  consider  the  auxiliary  regression  model  explaining  the  ;th 
regressor  in  terms  of  these  m  instruments  —  that  is, 


Xji  =  z'y;-  +  Vji,  i  =  1,  •  •  • ,  n.  (5.80) 

Here  y;  is  anmx  1  vector  of  parameters  and  Vj,  are  error  terms.  Because  of  (5.70) 
it  follows  that  £[x;<e,]  =  E[f,,e«],  and  the  null  hypothesis  of  exogeneity  is  equiva¬ 
lent  to  E[vji£i\  =  0  for  /'  =  k  —  ko  +  1,  •  •  • ,  k.  Let  v,  be  the  ko  x  1  vector  with 
components  iy,;  then  the  condition  is  that  £,  is  uncorrelated  with  all  components 
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of  Vi.  If  we  assume  that  all  error  terms  are  normally  distributed,  the  condition 
becomes  more  specific,  in  that  tr,  and  e,  are  independent  —  that  is,  that  in  the 
conditional  expectation  £[e,|t',]  =  i/a  there  holds  a  =  0.  Let  w,  =  e,-  —  E[e,|^,]; 
then  s,  =  v'jCC  +  w,,  where  Wj  is  independent  of  v,,  and  the  condition  of  exogeneity 
can  be  expressed  as  follows: 

e,  =  v'fj.  +  wt,  Ho:a=0.  (5.81) 

Substituting  the  results  (5.81)  and  (5.80)  in  the  original  model  y,  =  x'J)  +  a, 
gives 


k  k 

y,  =  Y  PjXjt  +  Y  -  ^i)  +  w>-  (5-82) 

7=1  j=k-ko+l 

This  is  a  non-linear  regression  model,  as  it  involves  products  of  the  unknown 
parameters  a /  and  y-.  Assuming  a  joint  normal  distribution  for  the  error  terms  iVi 
in  (5.82)  and  Vj,  in  (5.80),  the  LM-test  for  the  hypothesis  that  a  =  0  can  be  derived 
along  the  lines  of  Section  4.3.6  (p.  238)  in  terms  of  the  score  vector  and  the 
Hessian  matrix  (see  (4.54)).  The  computations  are  straightforward  but  tedious 
and  are  left  as  an  exercise  for  the  interested  reader  (see  Exercise  5.17  for  the 
derivation). 


The  Hausman  LM-test  on  exogeneity 

The  Hausman  LM- test  on  exogeneity  can  be  computed  as  follows. 


Hausman  text  on  exogeneity 

•  Step  1:  Perform  preliminary  regressions.  Regress  y  on  X,  with  n  x  1  re¬ 
sidual  vector  e  =  y  —  Xb.  Regress  every  possibly  endogenous  regressor  Xj 
on  Z  in  (5.80),  with  n  x  1  residual  vector  Vj  =  Xj  —  Zy-. 

•  Step  2:  Perform  the  auxiliary  regression.  Regress  e  on  X  (the  n  x  k  matrix 
including  both  the  ( k  —  ko)  exogenous  and  the  ko  possibly  endogenous 
regressors)  and  on  the  ko  series  of  residuals  t>£_k0+i,  •  •  • ,  t>^ — that  is, 
perform  OLS  in  the  model 

k  k 

e,'  =  Y  & ixi>  +  +  rli-  (5.83) 

7=1  j-k-k  0+1 

•  Step  3:  LM  =  nR 1  of  the  regression  in  step  2.  Then  LM  =  nR 2  where  R2  is 
the  coefficient  of  determination  of  the  regression  in  step  2.  Under  the 
hypothesis  that  all  ko  regressors  Xj,j  =  k  —  ko  +  1,  •  •  • ,  k,  are  exogenous, 
LM  has  asymptotically  the  X2(&o)  distribution. 
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This  three-step  method  to  compute  the  LM- test  by  means  of  auxiliary 
regressions  is  similar  to  the  LM- test  procedure  described  in  Section  4.3.7 
(p.  238-40)  for  the  linear  model.  In  step  1  the  model  (5.82)  is  estimated 
under  the  null  hypothesis  —  that  is,  with  a  =  0.  In  step  2,  the  residuals  of  step 
1  are  regressed  on  all  explanatory  variables  in  the  unrestricted  model  (5.82). 
Because  the  regressors  Vji  =  xn  —  z'f/j  are  unknown  (as  the  parameters  y;  are 
unknown),  they  are  replaced  by  the  residuals  t>;,  obtained  by  regressing  the 
;th  regressor  on  the  m  instruments. 


Comments  on  the  t/W-test 

The  null  hypothesis  of  exogeneity  —  that  is,  a,  =  0  for  j  =  k  —  ko  +  1,  •  •  • ,  k  in 
(5.82) — can  also  be  tested  by  the  usual  F-test  on  the  joint  significance  of  these 
parameters  in  the  regression  (5.83).  Under  the  null  hypothesis,  this  test  statistic  is 
asymptotically  distributed  as  F(k o,  n  —  k  —  ko).  This  F-test  and  the  LM-test  of  step 
3  above  are  asymptotically  equivalent.  That  is,  in  large  enough  samples  they 
provide  (nearly)  the  same  P-value  and  hence  both  tests  lead  to  the 
same  conclusion  (rejection  or  not)  concerning  the  exogeneity  of  the  last  ko 
regressors. 

In  another  version  of  the  F-test  on  exogeneity,  the  ‘explained’  variable  e, 
in  (5.83)  is  replaced  by  the  dependent  variable  y,.  As  e  =  y  —  Xb,  both  regres¬ 
sions  (with  e,  or  with  y;  on  the  left-hand  side)  have  the  same  residual  sum  of 
squares  (as  all  k  regressors  x,  are  included  on  the  right-hand  side).  That  is, 
the  F-test  on  the  joint  significance  of  the  parameters  a,  can  be  equivalently 
performed  in  the  regression  equation  (5.83)  or  in  the  same  equation  with  e, 
replaced  by  yt. 

Summarizing,  exogeneity  is  equivalent  to  the  condition  that  £[t'/,s(]  =  0,  where 
8/  are  the  error  terms  in  the  model  y;  =  xj/1  +  8/  and  v/j  are  the  error  terms  in  (5.80). 
As  error  terms  are  not  observed,  they  are  replaced  by  residuals  in  step  1  and  the 
correlation  between  the  residuals  e,-  and  Vj,  is  evaluated  by  the  regression  (5.83). 
Endogeneity  of  the  regressors  is  indicated  by  significant  correlations  —  that  is,  by  a 
significant  R 2  and  significant  estimates  of  the  parameters  a y. 


Sargan  test  on  validity  of  instruments 

Finally  we  consider  the  question  whether  the  instruments  are  valid.  That  is, 
we  test  whether  the  instruments  are  exogenous  in  the  sense  that  condition 
(5.70)  is  satisfied.  This  assumption  is  critical  in  all  the  foregoing  results.  If  the 
instruments  are  not  exogenous,  then  IV  is  not  consistent  and  also  the  Haus- 
man  test  is  not  correct  anymore.  In  some  cases  the  exogeneity  of  the  instru¬ 
ments  is  reasonable  from  an  economic  point  of  view,  but  in  other  situations 
this  may  be  less  clear.  We  illustrate  this  later  with  two  examples.  A  simple 
idea  to  test  (5.70)  is  to  replace  the  (unobserved)  error  terms  £,  by  reliable 
estimates  of  these  error  terms.  As  the  regressors  may  be  endogenous,  we 
should  take  not  the  OTS  residuals  but  the  IV  residuals  e/y  =  y  —  Xbiv-  Under 
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the  null  hypothesis  that  the  instruments  are  exogenous,  biv  is  consistent  and 
eiv  provides  reliable  estimates  of  the  vector  of  error  terms  e.  We  test  (5.70)  by 
testing  whether  z,  is  uncorrelated  with  c/y„  the  zth  component  of  e/y.  This 
suggests  the  following  test,  which  is  called  the  Sargan  test  on  the  validity  of 
instruments. 


Sargan  test  on  the  validity  of  instruments 

•  Step  1:  Apply  IV.  Estimate  y  =  Xf  +  e  by  IV,  with  n  x  1  residual  vector 
eiv  =  y~  XbIv. 

•  Step  2:  Perform  auxiliary  regression.  Regress  c/y  on  Z  in  the  model 

eivi  =  z-y  +  rjj. 

•  Step  3:  LM  =  nR 1  of  the  regression  in  step  2.  Compute  LM  =  nR 2 
of  the  regression  in  step  2.  Under  the  null  hypothesis  that  the  instruments 
are  exogenous,  LM  asymptotically  has  the  X2(m  —  k)  distribution,  where  m 
is  the  number  of  instruments  (the  number  of  variables  in  z,)  and  k  is  the 
number  of  regressors  (the  number  of  variables  in  x,). 


Derivation  of  the  distribution  of  the  Sargan  test 

To  derive  the  distribution  under  the  null  hypothesis,  in  particular  that  the  degrees 
of  freedom  is  equal  to  (m  —  k),  we  recall  that  IV  corresponds  to  GMM  with 
moment  conditions  (5.70).  If  m  >  k  —  that  is,  in  the  over-identified  case  —  we 
can  apply  the  GMM  test  on  over-identifying  restrictions  of  Section  4.4.3  (p.  258). 
Using  the  notation  of  Section  4.4.3  (p.  253),  the  moment  functions  corresponding 
to  (5.70)  are 


g,  =  Zfii  =  Zjiyt  -  x\P), 

and  Gn  =  E"-i  gi  =  E”=  l  Zfii  =  z'e  and  /»  =  E/U  gig'i  =  E"=  i  Ef z izft.  Evaluated  at 
the  GMM  estimator  biv,  we  get  Gn  =  Z'e/y  and  plim(  jJn)  =  plim(I  E”=i  £fziz[)  = 
o2Qzz-  If  we  approximate  Qzz  by  \Z'Z  and  a 2  by  \ e',ve/y,  we  get 
Jn  ~  7te'iveivZ'Z.  Then  the  test  on  over-identifying  restrictions  —  that  is,  the  /-test 
(4.69)  —  is  given  by 


G'J-'Gn  = 


■'Z(Z'Z)-XZ> 


eiveiv 


e-^  =  nR2 


with  the  R2  of  the  regression  in  step  2  above.  So  our  intuitive  arguments  for  the 
Sargan  test  can  be  justified  by  the  GMM  test  on  over-identifying  restrictions. 
According  to  Section  4.4.3  (p.  258),  under  the  null  hypothesis  of  exogenous 
instruments  there  holds 
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LM  =  nR2  «  %2(m  —  k). 

The  validity  of  the  instruments  is  rejected  for  large  values  of  this  test  statistic.  Note 
that  the  validity  can  be  checked  only  if  m  >  k  —  that  is,  if  the  number  of  instru¬ 
ments  exceeds  the  number  of  regressors.  In  the  exactly  identified  case  (m  =  k)  the 
validity  of  the  instruments  cannot  be  tested. 


t  t  T 


XM511IBR 


Example  5.33:  Interest  and  Bond  Rates  (continued) 

We  continue  our  previous  analysis  of  the  interest  and  bond  rate  data 
in  Example  5.30.  We  will  discuss  (i)  a  comparison  of  the  IV  and 
OLS  estimates,  (ii)  the  Hausman  test  on  exogeneity,  and  (iii)  the  Sargan  test 
on  the  validity  of  the  lagged  Treasury  Bill  rate  changes  as  instruments. 

(i)  Comparison  of  IV  and  OLS  estimates 

In  Section  5.7.1  we  estimated  the  relation  between  changes  in  the  AAA  bond 
rate  (y,)  and  the  Treasury  Bill  rate  (x,)  by  instrumental  variables,  with  the 
lagged  values  x,_i  and  x,-_ 2  as  instruments.  Exhibit  5.43,  Panels  3  and  4 
(p.  401),  shows  the  results  of  OLS  and  of  IV.  Denoting  the  estimates  of  a  and  /? 
in  y,  =  a  4-  /lx,  +  s,  by  a  and  b,  respectively,  we  see  that  aiy  —  a  =  —0.004 
and  bw  —  b  =  —0.137.  The  covariance  matrices  of  these  estimates  are 

/  23.8  1.6 

var  ols  =  s2(X'X)_1  =  1(T5 

V  1.6  56.1 
/  27.5  12.0 

varjy  =  sjyiX'X)-1  =  1(T5 

\  12.0  421.9 

If  one  uses  these  results  to  test  for  exogeneity,  it  follows  that 

(Z-b)  (v-„-v'3roU)-‘(^:“)=5.11. 

This  is  smaller  than  the  5  per  cent  critical  value  of  the  y2(2)  distribution 
(5.99),  so  that  at  this  significance  level  this  test  does  not  lead  to  rejection  of 
the  hypothesis  that  x,  is  exogenous. 

(ii)  Hausman  test  on  exogeneity 

As  the  above  test  is  not  so  reliable,  we  now  perform  the  Hausman  test.  As 
in  the  model  y,  =  a  +  /lx,  +  s,  we  have  k  =  2  and  as  the  constant  term  is 
exogenous,  it  follows  that  ko  =  1.  The  result  of  step  2  of  the  Hausman  test 
is  in  Panel  1  of  Exhibit  5.47,  where  ‘resaux’  stands  for  the  residuals  obtained 
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Panel  1:  Step  2  of  Hausman  test;  Dependent  Variable:  RESOLS 

Method:  Least  Squares 

Sample:  1980:01  1999:12 

Included  observations:  240 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-0.003895 

0.015359 

-0.253610 

0.8000 

DUS3MT 

-0.136674 

0.060199 

-2.270359 

0.0241 

RESAUX 

0.161106 

0.065359 

2.464945 

0.0144 

R-squared 

0.024996 

Panel  2:  Correlations  between  IV  residuals  and  lagged  values  of  DUS3MT 

Lag  0 

1  2 

3  4  5 

6  7  8  9 

10 

Corr.  0.35 

-0.01  -0.01  - 

0.07  0.07  0.16 

-0.06  0.02  0.13  -0.02 

-0.01 

(c) 


(d) 


(e) 

1.5 
1.0 
>  0.5 

1/5 
UJ 

^  0.0 
-0.5 
-1.0 

-6  -4  -2  0  2  4 

DUS3MTLAG2 


(f) 


Panel  6:  Step  2  of  Sargan  test;  Dependent  Variable:  RESIV 
Method:  Least  Squares 
Sample:  1980:01  1999:12 
Included  observations:  240 


Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-0.000156 

0.016525 

-0.009431 

0.9925 

DUS3MT(— 1) 

-0.002218 

0.026395 

-0.084042 

0.9331 

DUS3MT(— 2) 

-0.003387 

0.026378 

-0.128395 

0.8979 

R-squared 

0.000135 

Exhibit  5.47  Interest  and  Bond  Rates  (Example  5.33) 

Panel  1  contains  the  regression  of  step  2  of  the  Hausman  test  on  exogeneity  of  the 
explanatory  variable  DUS3MT  (RESOLS  and  RESAUX  are  the  residuals  obtained  in  step  1 
of  the  Hausman  test,  RESOLS  are  the  residuals  of  the  regression  in  Panel  3  of  Exhibit  5.43, 
and  RESAUX  are  the  residuals  of  the  regression  in  Panel  2  of  Exhibit  5.46).  Panel  2  shows 
the  correlations  of  the  IV  residuals  with  lags  of  the  explanatory  variable  for  lags  0-10,  and  the 
three  scatter  diagrams  are  for  lags  0  (c),  1  (d),  and  2  (e).  Panel  6  contains  the  regression  for  step 
2  of  the  Sargan  test  on  validity  of  instruments  (RESIV  are  the  IV  residuals  obtained  in  step  1  of 
this  test;  this  regression  is  shown  in  Panel  1  of  Exhibit  5.46). 


by  regressing  x,  on  the  instruments  x,-\,  x,_2  and  a  constant  term.  The 
t-test  on  the  significance  of  ‘resaux’  has  a  P-value  of  0.014,  and  the 
Hausman  LM-test  gives  LM  =  nR1  =  240  ■  0.024996  =  6.00,  with  P-value 
(corresponding  to  the  /2(  1)  distribution)  P  =  0.014.  This  indicates  that  the 
assumption  of  exogeneity  should  be  rejected,  and  that  the  OLS  estimator 
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may  be  considerably  biased.  The  IV  estimate  of  the  slope  is  much  smaller 
than  the  OLS  estimate  and  it  has  a  much  larger  standard  error  (0.065 
instead  of  the  computed  value  of  0.024  for  OLS  (see  Panels  3  and  4  of 
Exhibit  5.43)). 

(iii)  Sargan  test  on  validity  of  instruments 

The  IV  estimates  can  be  trusted  only  if  the  instruments  x,_i  and  x,-_ 2  are 
exogenous  —  that  is,  this  requires  that  L[x;_|£;']  =  -E[x;_2L]  =  0.  Exhibit 
5.47  shows  the  correlations  between  lagged  values  of  x,  and  the  IV  residuals 
ejy  (in  Panel  2)  and  scatters  of  the  IV  residuals  against  x„  x,-_i,  and  x,_ 2  (see 
(c),  (d),  and  (e)).  This  indicates  that  x,  is  indeed  not  exogenous  but  that  x,-_i 
and  x,'_2  are  exogenous  (with  correlations  of  around  —0.01,  both  between 
x,_i  and  ejy  and  between  x,-_2  and  ejy). 

Panel  6  of  Exhibit  5.47  shows  the  regression  of  step  2  of  the  Sargan  test. 
This  gives  LM  =  nR 1  =  240  ■  0.000135  =  0.032.  As  there  are  m  =  3  instru¬ 
ments  (the  constant  term  and  x,_i  and  x,-_2)  and  k  =  2  regressors  (the 
constant  term  and  x,-),  it  follows  that  the  ^-distribution  has  (m  —  k)  =  1 
degree  of  freedom.  The  P-value  of  the  LM- test,  corresponding  to  the  /2(1) 
distribution,  is  P  =  0.86.  This  indicates  that  the  lagged  values  of  x,  are  valid 
instruments. 

Example  5.34:  Motor  Gasoline  Consumption  (continued) 

Next  we  consider  the  data  on  motor  gasoline  consumption  introduced 
in  Example  5.31.  We  will  discuss  (i)  the  Hausman  test  on  exogeneity  of  the 
gasoline  price,  (ii)  the  Sargan  test  on  the  validity  of  the  price  indices  as 
instruments,  and  (iii)  a  remark  on  the  required  model  assumptions. 

(i)  Hausman  test  on  the  exogeneity  of  the  gasoline  price 

In  Example  5.31  we  considered  the  relation  between  gasoline  consumption 
(GC),  gasoline  price  (PG),  and  disposable  income  (RI)  in  the  USA.  We 
postulated  the  demand  equation 

GCi  =  a  +  pPG,  +  yRI,  +  e,. 

We  supposed  that  RI  is  exogenous  and  considered  the  possible  endogeneity 
of  PG.  The  outcomes  of  OLS  and  IV  estimates  (with  five  instruments  —  that 
is,  a  constant,  RI,  and  the  three  price  indices  RPT,  RPN,  and  RPU)  in  Panels 
1  and  2  of  Exhibit  5.45  turned  out  to  be  close  together,  suggesting  that  PG  is 
exogenous.  Panel  1  of  Exhibit  5.48  shows  the  regression  of  step  2  of  the 
Hausman  test,  with  outcome  LM  =  nR1  =  2.38.  Since  the  constant  and  the 
income  are  assumed  to  be  exogenous  and  the  price  PG  is  the  only  possibly 
endogenous  variable,  we  have  ko  =  1.  So  the  distribution  of  the  LM- test 
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Panel  1:  Step  2  of  Hausman  test;  Dependent  Variable:  RESOLS 
Method:  Least  Squares 
Sample:  1970  1999 
Included  observations:  30 


Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

0.027703 

0.081429 

0.340209 

0.7364 

PG 

-0.016872 

0.028093 

-0.600566 

0.5533 

RI 

-0.008558 

0.024638 

-0.347347 

0.7311 

RESAUX 

0.104845 

0.070032 

1.497107 

0.1464 

R-squared 

0.079363 

Panel  2:  Step  2  of  Sargan  test;  Dependent  Variable:  RESIV 
Method:  Least  Squares 

Sample:  1970  1999 

Included  observations:  30 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-0.209753 

0.271047 

-0.773862 

0.4463 

RPT 

-0.051202 

0.062169 

-0.823606 

0.4180 

RPN 

0.020409 

0.114431 

0.178352 

0.8599 

RPU 

-0.070229 

0.059531 

-1.179698 

0.2492 

RI 

0.060410 

0.080326 

0.752055 

0.4590 

R-squared 

0.104159 

Exhibit  5.48  Motor  Gasoline  Consumption  (Example  5.34) 

Panel  1  shows  the  regression  for  step  2  of  the  Elausman  test  on  exogeneity  of  the  explanatory 
variable  PG  (RESOLS  and  RESAUX  are  the  residuals  obtained  in  step  1  of  the  Hausman  test, 
RESOLS  are  the  residuals  of  the  regression  in  Panel  1  of  Exhibit  5.45  and  RESAUX  are  the 
residuals  of  the  regression  in  Panel  3  of  Exhibit  5.45).  Panel  2  shows  the  regression  for  step  2  of 
the  Sargan  test  on  validity  of  instruments  (RESIV  are  the  IV  residuals  obtained  in  step  1  of  this 
test;  this  regression  is  shown  in  Panel  2  of  Exhibit  5.45). 

(under  the  null  hypothesis  of  exogeneity)  is  x2(l),  which  gives  a  P-value 
of  P  =  0.12.  This  does  not  lead  to  rejection  of  the  exogeneity  of  the 
variable  PG. 

(ii)  Sargan  test  on  validity  of  instruments 

Panel  2  of  Exhibit  5.48  shows  the  regression  of  step  2  of  the  Sargan  test. 
Here  we  test  whether  the  five  instruments  are  exogenous.  In  this  case  k  =  3 
and  m  =  5,  so  that  LM  =  nR 2  =  3.12  should  be  compared  with  the  /2(2) 
distribution.  The  corresponding  P-value  is  P  =  0.21,  so  that  the  exogeneity 
of  the  instruments  is  not  rejected.  However,  note  that  IV  estimation  is  not 
required  as  the  regressor  PG  seems  to  be  exogenous.  For  these  data  we 
therefore  prefer  OLS,  as  OLS  is  consistent  and  gives  (somewhat)  smaller 
standard  errors  (see  the  results  in  Exhibit  5.45,  Panel  1  for  OTS  and  Panel  2 
for  IV). 

(iii)  Remark  on  required  model  assumptions 

We  conclude  by  mentioning  that  the  above  tests  require  that  the  standard 
Assumptions  2-6  of  the  regression  model  are  satisfied.  It  is  left  as  an  exercise 
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(see  Exercise  5.31)  to  show  that  the  residuals  of  the  above  demand  equation 
for  motor  gasoline  consumption  show  significant  serial  correlation.  There¬ 
fore  we  should  give  the  above  test  outcomes  on  exogeneity  and  validity  of 
instruments  the  correct  interpretation  —  that  is,  as  diagnostic  tests  indicating 
possible  problems  with  OLS.  In  particular,  the  OLS  estimates  are  not  effi¬ 
cient,  as  they  neglect  the  serial  correlation  of  the  disturbances.  Similar 
remarks  apply  to  our  analysis  of  the  interest  and  bond  rate  data  in  Example 
5.33,  as  in  Examples  5.27  and  5.28  we  concluded  that  these  data  contain 
some  outliers. 

=©  Exercises:  T:  5.16d,  5.17;  S:  5.23e. 


5.7.4  Summary 

The  OLS  method  becomes  inconsistent  if  the  regressors  are  not  exogen¬ 
ous.  In  this  case  the  OLS  estimates  may  provide  very  misleading  infor¬ 
mation.  One  may  proceed  as  follows. 

•  First  of  all,  try  to  use  economic  intuition  to  guess  whether  endogeneity 
might  play  a  role  for  the  investigation  at  hand.  If  it  does,  one  should  find 
a  sufficient  number  of  instruments  that  are  exogenous  and  that  carry 
information  on  the  possibly  endogenous  regressors  (that  is,  the  order 
and  rank  conditions  should  be  satisfied). 

•  Investigate  the  possible  endogeneity  of  ‘suspect’  regressors  by  means  of 
the  Hausman  test.  If  one  has  a  sufficiently  large  number  of  instruments, 
then  perform  the  Sargan  test  to  check  whether  the  proposed  instruments 
are  indeed  exogenous. 

•  If  some  of  the  regressors  are  endogenous  and  the  instruments  are  valid, 
then  consistent  estimates  are  obtained  by  the  instrumental  variables 
estimation  method.  The  t-  and  F- tests  can  be  performed  as  usual, 
although  some  care  is  needed  to  use  the  correct  formulas  (see  (5.77) 
and  (5.78)). 

•  If  the  endogeneity  is  only  weak,  then  OLS  may  be  considered  as  an 
alternative,  provided  that  the  resulting  bias  is  compensated  by  a  suffi¬ 
ciently  large  increase  in  efficiency  as  compared  to  IV. 
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5.8  Illustration:  Salaries  of 
top  managers 


The  discussion  in  this  chapter  could  lead  one  to  think  that  ordinary  least 
squares  is  threatened  from  so  many  sides  that  it  never  works  in  practice.  This 
is  not  true.  OLS  is  a  natural  first  step  in  estimating  economic  relations  and  in 
many  cases  it  provides  valuable  insight  in  the  nature  of  such  relations.  By 
means  of  the  following  example  we  will  illustrate  that  in  some  cases  OLS 
provides  a  reasonable  model  that  performs  well  under  various  relevant 
diagnostic  tests. 

Example  5.35:  Salaries  of  Top  Managers 

As  an  example  we  analyse  the  relation  between  salaries  of  top  managers  and 
profits  of  firms.  The  data  set  consists  of  100  large  firms  in  the  Netherlands  in 
1999.  The  100  firms  are  ordered  with  increasing  profits.  Let  y,  be  the 
logarithm  of  the  average  yearly  salary  (in  thousands  of  Dutch  guilders)  of 
top  managers  of  firm  i  and  let  x,  be  the  logarithm  of  the  profit  (in  millions  of 
Dutch  guilders)  of  firm  i.  Results  of  OLS  in  the  model 


yi  =  ol  +  Pxj  +  Si 


are  in  Panel  3  of  Exhibit  5.49,  and  this  exhibit  also  shows  the  outcomes  of 
various  diagnostic  tests  discussed  in  this  chapter.  The  tests  in  Exhibits  5.49 
( e-q )  do  not  indicate  any  misspecification  of  the  model,  so  that  we  are 
satisfied  with  this  simple  relation.  The  estimated  elasticity  /?  is  around  16 
per  cent,  so  that  salaries  of  top  managers  tend  to  be  rather  inelastic  with 
respect  to  profits  when  compared  over  this  cross  section  of  firms. 
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(a)  ( b ) 


(c) 


Panel  3:  Dependent  Variable:  LOGSALARY 
Method:  Least  Squares 
Sample(adjusted):  5  100 
Included  observations:  96  after  adjusting  endpoints 


Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

6.350338 

0.128961 

49.24249 

0.0000 

LOGPROFIT 

0.162212 

0.021984 

7.378765 

0.0000 

R-squared 

0.366774 

Mean  dependent  var 

7.269493 

Adjusted  R-squared 

0.360037 

S.D.  dependent  var 

0.408740 

S.E.  of  regression 

0.326982 

Akaike  info  criterion 

0.622791 

Sum  squared  resid 

10.05023 

Schwarz  criterion 

0.676215 

Log  likelihood 

-27.89396 

F-statistic 

54.44617 

Durbin-Watson  stat 

2.233248 

Prob(F-statistic) 

0.000000 

(d) 


Residual  - Actual - Fitted 


Exhibit  5.49  Salaries  of  Top  Managers  (Example  5.35) 

Scatter  diagrams  of  salary  against  profit  (in  levels  (a)  and  in  logarithms  (£>)),  regression 
table  (with  variables  in  logarithms,  Panel  3),  and  graph  of  actual  and  fitted  (logarithmic) 
salaries  and  corresponding  least  squares  residuals  ((d);  the  data  are  ordered  with 
increasing  values  of  profits;  the  original  number  of  observations  is  100,  but  the  number 
of  observations  in  estimation  is  96,  as  4  firms  have  negative  profits). 
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Panel  5:  Ramsey  RESET  Test: 

F-statistic  0.878905  Probability  0.350930 

Log  likelihood  ratio  0.902997  Probability  0.341979 

Test  Equation:  Dependent  Variable:  LOGSALARY 

Method:  Least  Squares 

Sample:  5  100;  Included  observations:  96 

Variable  Coefficient  Std.  Error  t-Statistic  Prob. 

C  -5.981640  13.15475  -0.454713  0.6504 

LOGPROFIT  -0.582640  0.794813  -0.733053  0.4654 

FITTED A 2  0.312867  0.333725  0.937499  0.3509 


(i) 


(;') 


Panel  9:  Chow  Breakpoint  Test:  77 

F-statistic 

Log  likelihood  ratio 

0.845556 

1.748622 

Probability 

Probability 

0.432627 

0.417149 

Panel  10:  Chow  Forecast  Test:  Forecast  from  77  to  100 

F-statistic 

Log  likelihood  ratio 

0.873523 

25.14958 

Probability 

Probability 

0.634024 

0.397669 

Test  Equation:  Dependent  Variable:  LOGSALARY 

Method:  Least  Squares 

Sample:  5  76;  Included  observations:  72 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

6.521363 

0.222507 

29.30862 

0.0000 

LOGPROFIT 

0.124737 

0.044001 

2.834870 

0.0060 

R-squared 

0.102984 

Mean  dependent  var 

7.142288 

Adjusted  R-squared 

0.090169 

S.D.  dependent  var 

0.348475 

S.E.  of  regression 

0.332393 

Akaike  info  criterion 

0.662387 

Sum  squared  resid 

7.733956 

Schwarz  criterion 

0.725628 

Log  likelihood 

-21.84593 

F-statistic 

8.036490 

Durbin-Watson  stat 

2.306228 

Prob(F-statistic) 

0.005988 

Exhibit  5.49  (C ontd.) 

Diagnostic  tests,  RESET  (Panel  5),  recursive  residuals  with  CUSUM  and  CUSUMSQ  tests 
((f)-(h)),  Chow  break  test  (Panel  9),  and  Chow  forecast  test  (Panel  10),  both  with  72  firms 
(those  with  lower  profits)  in  the  first  subsample  and  with  24  firms  (those  with  higher  profits) 
in  the  second  subsample.  The  test  outcomes  do  not  give  reason  to  adjust  the  functional 
specification  of  the  model  (Assumptions  2,  5,  and  6). 
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Panel  11:  White  Heteroskedasticity  Test:  1 

F-statistic 

Obs*R-squared 

0.589335 

1.201465 

Probability 

Probability 

0.556754 

0.548410 

Test  Equation:  Dependent  Variable:  RESOLSA2 

Method:  Least  Squares 

Sample:  5  100;  Included  observations:  96 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-0.032151 

0.140858 

-0.228249 

0.8200 

LOGPROFIT 

0.048798 

0.046586 

1.047494 

0.2976 

LOGPROFITA2 

-0.004059 

0.003746 

-1.083502 

0.2814 

R-squared 

0.012515 

Panel  12:  Breusch-Godfrey  Serial  Correlation  LM  Test:  j 

F-statistic 

Obs*R-squared 

0.759000 

1.558288 

Probability 

Probability 

0.471043 

0.458798 

Test  Equation:  Dependent  Variable:  RESOLS 

Method:  Least  Squares 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-0.005453 

0.129438 

-0.042130 

0.9665 

LOGPROFIT 

0.001021 

0.022068 

0.046281 

0.9632 

RESOLS(-l) 

-0.119031 

0.104460 

-1.139490 

0.2575 

RESOLS(— 2) 

0.035061 

0.105207 

0.333255 

0.7397 

R-squared 

0.016232 

Panel  13:  CORRELATIONS  OF  RESOLS 

Lag 

AC 

Ljung-Box 

Prob 

1 

-0.122 

1.4799 

0.224 

2 

0.048 

1.7100 

0.425 

3 

-0.095 

2.6321 

0.452 

4 

0.262 

9.6293 

0.047 

5 

0.019 

9.6656 

0.085 

6 

-0.060 

10.045 

0.123 

7 

0.006 

10.049 

0.186 

8 

-0.111 

11.368 

0.182 

9 

0.172 

14.571 

0.103 

10 

0.106 

15.792 

0.106 

Exhibit  5.49  (Contd.) 

Diagnostic  tests,  White  test  on  heteroskedasticity  (Panel  11),  tests  on  serial  correlation 
(Breusch-Godfrey  LM-test  in  Panel  12  and  Ljung-Box  test  in  Panel  13),  and  test  on  normality 
(histogram  and  Jarque-Bera  test  («)).  RESOLS  denotes  the  OLS  residuals  of  the  regression  in 
Panel  3.  The  test  outcomes  do  not  give  reason  to  adjust  the  standard  probability  model  for  the 
disturbance  terms  (Assumptions  3,  4,  and  7). 
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(o) 


(P) 


(q) 


Panel  16:  Dependent  Variable:  LOGSALARY 
Method:  Instrumental  Variables 
Included  observations:  84 

Excluded  observations:  16  (missing  values  of  turnover) 
Instrument  list:  C  LOGTURNOVER 


Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

6.253435 

0.182995 

34.17263 

0.0000 

LOGPROFIT 

0.181561 

0.031937 

5.685017 

0.0000 

R-squared 

0.385981 

Panel  17:  Dependent  Variable:  RESOLS 

Method:  Least  Squares 

Included  observations:  84 

Excluded  observations:  16  (missing  values  of  turnover) 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-0.096903 

0.184116 

-0.526314 

0.6001 

LOGPROFIT 

0.019349 

0.032132 

0.602173 

0.5487 

V 

-0.002833 

0.052095 

-0.054385 

0.9568 

R-squared 

0.006438 

Exhibit  5.49  ( Contd .) 


Diagnostic  tests,  instrumental  variable  estimate  of  the  wage  equation  (with  LOGTURNOVER 
as  instrument,  Panel  16;  the  scatter  diagram  of  the  explanatory  variable  against  the  instrument 
is  shown  in  (o))  and  step  2  of  the  Hausman  test  on  exogeneity  of  the  explanatory  variable 
(LOGPROFIT)  in  the  wage  equation  (Panel  17;  RESOLS  denotes  the  OLS  residuals  of  the 
regression  in  Panel  3  and  V  denotes  the  residuals  of  the  regression  of  LOGPROFIT  on  a 
constant  and  LOGTURNOVER).  The  sample  size  in  estimation  is  84  because  the  turnover 
of  some  of  the  firms  is  unknown.  The  test  outcomes  do  not  give  reason  to  reject  the  assumption 
of  exogeneity  of  profits  in  the  wage  equation  for  top  managers  (Assumption  1). 


=©  Exercises:  E:  5.32. 
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Summary,  further  reading, 
and  keywords 


SUMMARY 

In  this  chapter  the  seven  standard  assumptions  of  the  regression  model  were 
subjected  to  diagnostic  tests.  The  exogeneity  of  the  regressors  (Assumption  1 ) 
is  required  for  OLS  to  be  consistent,  and  this  was  investigated  in  Section  5.7. 
If  the  regressors  are  endogenous,  then  consistent  estimates  can  be  obtained 
by  using  instrumental  variables.  The  functional  specification  of  the  model 
(linear  model  with  constant  parameters,  Assumptions  2,  5,  and  6)  was 
discussed  in  Sections  5.2  and  5.3.  A  correct  specification  is  required  to  get 
consistent  estimators.  In  practice  it  may  be  worthwhile  excluding  the  less 
relevant  variables  —  namely,  if  the  resulting  bias  is  compensated  by  an  in¬ 
creased  efficiency  of  the  estimators.  We  also  discussed  methods  for  the 
specification  and  estimation  of  non-linear  models  and  models  with  varying 
parameters.  If  the  disturbances  of  the  model  are  heteroskedastic  or  serially 
correlated  (so  that  Assumptions  3  or  4  are  not  satisfied),  then  OLS  is  consist¬ 
ent  but  not  efficient.  The  efficiency  can  be  increased  by  using  weighted  least 
squares  (based  on  a  model  for  the  variances  of  the  disturbances)  or  by 
transforming  the  model  (to  remove  the  serial  correlation  of  the  disturbances). 
This  was  discussed  in  Sections  5.4  and  5.5.  In  Section  5.6  we  considered  the 
assumption  of  normally  distributed  disturbances  (Assumption  7).  If  the 
disturbances  are  not  normally  distributed,  then  OLS  is  consistent  but  not 
efficient.  Regression  diagnostics  can  be  used  to  detect  influential  observa¬ 
tions,  and  if  there  are  relatively  many  outliers  then  robust  methods  can 
improve  the  efficiency  of  the  estimators. 


FURTHER  READING 

The  textbooks  mentioned  in  Chapter  3,  Further  Reading  (p.  178-9),  contain 
chapters  on  most  of  the  topics  discussed  in  this  chapter.  For  a  more  extensive 
treatment  of  some  of  these  topics  we  refer  to  the  three  volumes  of  the  Handbook 
of  Econometrics  mentioned  in  Chapter  3.  We  mention  some  further  references: 
Belsley,  Kuh,  and  Welsch  (1980)  for  regression  diagnostics;  Cleveland  (1993)  and 
Fan  and  Gijbels  (1996)  for  non-parametric  methods;  Godfrey  (1988)  for  diagnos¬ 
tic  tests;  Rousseeuw  and  Leroy  (1987)  for  robust  methods. 
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Exercises 


THEORY  QUESTIONS 

5.1  (=®  Section  5.2.1) 

Consider  the  model  y  =  XiPx  +  X2P2  +  E  with 
P2  7^  0.  It  was  shown  in  Section  3.2.3  (p.  143) 
that  the  restricted  least  squares  estimator 
bR  =  {X\  X|  p1  X\y  has  a  variance  that  is  smaller 
than  that  of  the  unrestricted  least  squares  estimator 
b\  in  the  model  that  includes  both  Xi  and  X2. 

a.  Show  that  the  standard  error  of  the  regression  s 
may  be  larger  in  the  restricted  model.  Is  the 
standard  error  in  the  restricted  model  always 
larger? 

b.  Show  that,  as  a  consequence,  the  estimates  bR 
need  not  be  more  significant  (in  the  sense  of 
having  larger  f-values)  than  the  estimates  b\ . 

c.  Verify  the  results  in  a  and  b  by  simulating  a  data 
set  with  sample  size  n  =  100  by  means  of  the 
model  y  =  Pi  +  P2X  +  e,  where  xy  =  1  is  the 
constant  term  and  X2,  and  e,  are  two  independent 
samples  from  the  standard  normal  distribution. 
As  parameter  values  take  Pi  =  1  and  p2  =  10. 

d.  Discuss  the  relevance  of  your  findings  for  the 
‘bottom-up’  strategy  in  model  selection,  which 
starts  with  small  models  and  performs  sequential 
tests  on  the  significance  of  additional  variables. 

5.2  Sections  5.2.1,  5.2.4) 

a.  Using  the  notation  of  Section  5.2.1,  show  that 
MSE(bi)  -  MSE(bR)  =  P(V2  -  P2P2W where 
P  =  (X,1Xi)_1X,1X2  and  where  V2  =  var (bi)  is 
the  g  x  g  covariance  matrix  of  bi  in  the  model 
y  =  X1P1  +  X2P2  +  £• 

b.  Using  again  the  notation  of  Section  5.2.1,  prove 
that  TMSP(hx)  <  TMSP(fci)  if  and  only  if 

P'2vi1p2<g- 

c.  Prove  that,  for  n  sufficiently  large,  AIC 
corresponds  to  an  P-test  with  critical  value  ap¬ 
proximately  equal  to  2. 

d.  Prove  that  SIC  corresponds  to  an  P-test  with 
critical  value  approximately  equal  to  log  («). 


e.  Suppose  that  log  (y)  ~  N(/(,  er2);  then  prove  that  y 
has  mean  e,1+z<r2,  median  and  variance 
e1fl+a\eal  -  1). 

f.  Consider  the  non-linear  wage  model  5(A)  =  a  + 
yDg  +  /(D,„  +  Px  +  e  in  Example  5.5,  where 
5(A)  =  (S2  —  1)/A.  Prove  that  in  this  model 
(dS/dx)/S  =  p/(  1  +  A(a  +  yDg  +  /iDm  +  Px  +  e)). 

5.3  (“®  Section  5.3.2) 

a.  Prove  the  expressions  (5.12)— (5.14).  It  is  helpful 
to  write  out  the  normal  equations  X)+1  X(+1  bt+i 
=  X't+1Yt+ 1,  where  Yt+i  =  (yi,  ■  ■  ■  ,yt+i)'  and 
Xf+i  =  (xi,  •  •  •  ,xt+i)'  =  (X),x^+l),. 

b.  Prove  that  the  variances  of  the  forecast  errors  ft 
in  (5.11)  are  equal  to  <j2vt. 

c.  Prove  that  the  forecast  errors  ft  are  independent 
under  the  standard  Assumptions  1-7. 

5.4  (=®>  Section  5.3.3) 

a.  Prove  the  result  (5.20)  for  the  hypothesis 

(5.19) . 

b.  The  P-test  requires  that  the  disturbance  vectors 
si  and  S2  in  (5.18)  are  uncorrelated  with  mean 
zero  and  covariance  matrices  o\l,n  and  o\Ini, 
where  a\  =  a2 .  Derive  a  test  for  the  hypothesis 

(5.19)  for  the  case  that  <r2  a\. 

c.  Prove  that  the  P-test  for  the  hypothesis  (5.22)  in 
the  model  (5.21)  is  equal  to  the  forecast  test  in 
Section  3.4.3. 

5.5  (“®  Section  5.4.3) 

Consider  the  model  y,  =  /lx,  +  s,  (without  constant 

term  and  with  k  =  1),  where  £[s,]  =  0,  £[s;s;'] 

=  0  for  i  ^  j,  and  E[s2]  =  of. 

a.  Consider  the  following  three  estimators  of  p'. 

b]  =  J2  XiJ'/  J2  xh  bi  =  J2  Jit  2  *<>  an<i  ^3  = 
sE(y</*»)-  P°r  each  estimator,  derive  a  model 
for  the  variances  a2  for  which  this  estimator  is 
the  best  linear  unbiased  estimator  of  p. 
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b.  Let  ay  =  a2x2 .  Show  that  the  OLS  estimator  is 
unbiased.  Derive  expressions  for  the  variance 
of  the  OLS  estimator  and  also  for  the  WLS 
estimator. 

c.  Use  the  results  in  b  to  show  that  the  OLS  esti¬ 
mator  has  a  variance  that  is  at  least  as  large  as 
that  of  the  WLS  estimator. 

d*.  For  the  general  case,  show  that  the  OLS  vari¬ 
ance  in  (5.24)  is  always  at  least  as  large  as  the 
WLS  variance  (5.29)  in  the  sense  that 
var (b)  —  var (bt)  is  positive  semidefinite. 

5.6  (“®  Sections  5.4.4,  5.4.5) 

a.  In  the  additive  heteroskedasticity  model  with 
of  =  z'y,  the  variances  are  estimated  by  the  re¬ 
gression  e f  =  z'y  +  with  error  terms 
»7,  =  ef  —  of  (see  Section  5.4.4).  Now  assume 
that  plim(  ';^z,z')  =  Qc*  exists  with  Qzz  an 
invertible  matrix,  and  assume  that  the  vari¬ 
ables  Zi  are  exogenous  in  the  sense  that 
plim(i^z,(Ef  —  of))  =  0.  Show  that  y  is  esti¬ 
mated  consistently  under  this  assumption. 

b*.  Prove  the  results  that  are  stated  in  Section 
5.4.4  for  the  consistent  estimation  of  the  param¬ 
eters  of  the  multiplicative  model  for  heteroske¬ 
dasticity. 

c.  Derive  the  expression  (5.39)  for  the  LR- test  for 
groupwise  heteroskedasticity  in  the  model 
y  =  Xp  +  E. 

5.7*  (-»  Section  5.4.5) 

In  this  exercise  we  derive  the  three-step  method  for 
the  computation  of  the  Breusch-Pagan  test  on 
homoskedasticity.  Consider  the  model  y  =  X/?  +  s, 
which  satisfies  the  standard  regression  Assumptions 
1-7,  except  for  Assumption  3,  which  is  replaced  by 
the  model  (5.26). 

a.  Let  af  =  a2z“,  where  z,  is  a  single  explanatory 
variable  that  takes  on  only  positive  values. 
Derive  the  log-likelihood  for  this  model,  with 
parameter  vector  6  =  (ft,  a,  a2)',  and  determine 
the  first  order  conditions  (for  ML)  and  the  infor¬ 
mation  matrix. 

b.  Derive  the  LM- test  for  homoskedasticity 
(a  =  0),  using  the  results  in  a.  Show  in  particular 
that  this  can  be  written  as  LM  =  SSE/2,  where 
SSE  is  the  explained  sum  of  squares  of  the  re¬ 
gression  of  ei/SML  on  a  constant  and  log  (z,)  and 
where  s^L  =  e'e/n. 


c.  Show  that,  in  large  enough  samples,  the  result  in 
b  can  also  be  written  as  LM  =  nR 2  of  the  regres¬ 
sion  of  e2  on  a  constant  and  log  (zi). 

d.  Now  consider  the  general  model  (5.26).  Derive 
the  log-likelihood  and  its  first  order  derivatives. 
Show  that  the  LM-test  for  y2  =  ■  ■  ■  =  yp  =  0  is 
given  by  LM  =  SSE/2,  where  SSE  is  the  ex¬ 
plained  sum  of  squares  of  the  regression  of 

ef/SML  0n  Z- 

e.  Show  that  the  result  in  d  can  be  written  as 
LM  =  nR 2  of  the  auxiliary  regression  (5.40). 

5.8*  (“®  Section  5.5.3) 

In  this  exercise  we  consider  an  alternative  derivation 
of  the  Breusch-Godfrey  test  —  that  is,  the  auxiliary 
regression  (5.49).  In  the  text  this  test  was  derived  by 
using  the  results  of  Section  4.2.4  (p.  218)  on  non¬ 
linear  regression  models,  and  now  we  will  consider 
the  ML-based  version  of  this  test.  The  model  is  given 
by  (5.45)  with  AR(1)  errors  (5.47)  where 
rji  ~  NID(0,  a2).  The  parameter  vector  is 
9  =  (P',y,<72)'  and  the  null  hypothesis  of  no  serial 
correlation  corresponds  to  y  =  0. 

a.  Determine  the  log-likelihood  of  this  model  (for 
the  observations  (y2 , treating  y\  as  a 
fixed,  non-random  value). 

b.  Determine  the  first  and  second  order  derivatives 
of  the  log-likelihood  with  respect  to  the  param¬ 
eter  vector  9. 

c.  Use  the  results  in  a  and  b  to  compute  the  LM-test 
by  means  of  the  definition  in  (4.54)  in  Section 
4.3.6  (p.  238). 

d.  Prove  that  the  result  in  c  can  be  written  as  nR 2  of 

the  auxiliary  regression  (5.49).  It  may  be  as¬ 
sumed  that  the  model  (5.45)  contains  a  constant 
term  —  that  is,  x\,j  =  1  for  i=l,---,n  —  and 
that  phm(  i  ^  )  =  0,  where  e  is  the  vector 

of  OLS  residuals. 

5.9*  (°®  Section  5.5.3) 

In  this  exercise  we  show  that  the  Box-Pierce  (BP)  test 
(5.50)  is  asymptotically  equivalent  to  the  Breusch- 
Godfrey  (BG)  test  obtained  by  the  regression  of  the 
OLS  residuals  e,  on  x,  and  the  lagged  values 

e. _i,  •  •  • ,  e,_p.  We  assume  that  the  explanatory  vari¬ 

ables  x  include  a  constant  term  and  that  they  satisfy 
the  conditions  that  plim(2^”=1  x,x')  =  Q  is  invert¬ 
ible  and  that  plim(  +1  ei-jxi)  =  0  for  all 

/  =  !,•••,/?.  Further  we  assume  that  under  the  null 
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hypothesis  of  absence  of  serial  correlation  there  also 
holds  eHe>)  =  0  for  all  /  =  1,  •  •  • , p, 

and  plim (i£>2)  =  <r2. 

a.  Write  the  regression  of  c,  on  x,  and  e,_ 
e,_p,  as  e  =  X<5  +  Ey  +  w,  where  the  columns  of 
E  consist  of  lagged  values  of  the  OLS  residuals. 
Let  d  and  y  be  the  OLS  estimators  obtained  from 
this  model;  then  show  that  (under  the  null 
hypothesis  of  no  serial  correlation)  ^JTid  w  0 
and  s/nyxi—r^E'e  (where  we  write  a  w  b  if 

plim(tf  —  b)  =  0). 

b.  Show  that  the  explained  sum  of  squares  SSE 
of  the  regression  in  a  satisfies  SSE « 

sE*=i  [E<  eiei-k]1 1 CJ1 . 

c.  Use  the  result  in  b  to  prove  that  for  the  regression 

in  a  there  holds  nR 2  =  «  n  YX=i  rk- 

5.10  (=©  Section  5.5.4) 

Consider  the  model  y,  =  fi  +  £,-,  where  /i  is  the  un¬ 
known  mean  of  the  variable  y  and  the  £,  are  error 
terms.  It  is  assumed  that  £i  =  (1  —y2)~1^1ij j  and 
£,-  =  yEi-i  +  >ij  for  /  =  2,  •  •  • ,  ra,  where  the  terms  )/, 
(with  mean  zero)  are  uncorrelated  and  homoskedas- 
tic  and  where  —  1  <  y  <  1 . 

a.  Show  that  the  error  terms  £,-  are  homoskedastic 
but  that  all  autocorrelations  are  non-zero.  De¬ 
scribe  in  detail  how  /(  can  be  estimated  by  the 
Cochrane-Orcutt  method. 

b.  Investigate  whether  the  estimator  of  a  is  un¬ 
biased.  Investigate  also  whether  it  is  consistent. 

c.  Now  suppose  that  the  error  terms  are  not  gener¬ 
ated  by  the  above  process,  but  that  instead  £i  = 
and  £,-  =  £,_i  +  >/,  for  i  =  2,  •  •  • ,  n.  Derive  the  best 
linear  unbiased  estimator  for  /( in  this  model. 

d.  Investigate  whether  the  estimator  of  c  is  un¬ 
biased.  Investigate  also  whether  it  is  consistent. 

e.  Try  to  give  an  intuitive  explanation  of  the  result 
in  d. 

5.11  (=®>  Section  5.5.3) 

Let  y,  =  [iy,-\  +  £,  and  £,  =  >'£,_,  +  t]n  where 
-1<P<  1  and  —  1  <  y  <  1  and  the  terms  »/,  are 
homoskedastic  and  uncorrelated.  By  b  we  denote 
the  OLS  estimator  of  P  and  by  r  the  estimator  of  y 
obtained  by  regressing  the  OLS  residuals  e;-  on  their 
lagged  values  <?,_  i. 

a.  Show  that  the  above  model  can  be  rewritten  as 
y-t  =  (P  +  y)yt~ i  -  Pyyi-i  +  1,  and  that  the  two 


transformed  parameters  (P  +  y  and  /iy )  can  be 
estimated  consistently  by  OLS.  Can  this  result 
be  used  to  estimate  ft  and  y? 

b.  Prove  that  plim(fi)  =  ft  +  \ 

c.  Prove  that  plim(£>)  +  plim(r)  =  ft  +  y. 

d.  What  is  the  implication  of  these  results  for  the 
Durbin-Watson  test  when  lagged  values  of  the 
dependent  variable  are  used  as  explanatory  vari¬ 
ables  in  a  regression  model? 


5.12  (“©  Sections  5.6.2,  5.6.4) 

In  this  exercise  we  use  the  notation  of  Sections  5.6.2 

and  5.6.4. 

a.  Prove  that  the  leverages  hj  in  (5.54)  satisfy 
0  <  hj  <  1  and  ]D"=  1  hj  =  k. 

b.  Let  b(j)  and  s2(/)  be  the  estimators  of  jS  and 
the  disturbance  variance  a1  obtained  by  the  re¬ 
gression  y,  =  x'jP  +  £,-,  i  ^  j — that  is,  by  leaving 
out  the  ;th  observation.  Further  let  P  and  y  be  the 
OLS  estimators  of  P  and  y  in  the  model  (5.55)  — 
that  is,  OLS  for  all  observations  but  with  a 
dummy  for  the  /th  observation  included  in  the 
model  —  and  let  sj  be  the  corresponding  esti¬ 
mated  disturbance  variance.  Prove  that  P  =  b{j), 
y  =  yj-  x'jb(j),  and  sj  =  s2(j). 

c.  The  studentized  residual  in  (5.56)  can  be  inter¬ 
preted  as  a  Chow  forecast  test  for  the  /th  obser¬ 
vation,  where  the  forecast  is  based  on  the  (n  —  1) 
observations  i  /.  Prove  this,  using  the  results  of 
b  and  of  Exercise  3.11. 

d.  Show  that  (5.64)  is  a  consistent  estimator  of  a  in 
the  simple  case  that  y,  =  p  +  £,-,  where  the  £,  are 
NID(0,  <t2). 


5.13  Section  5.6.2) 

Consider  the  simple  regression  model  y,  = 
a.  +  Pxj  +  £,-.  The  following  results  are  helpful  in 
computing  regression  diagnostics  for  this  model  by 
means  of  a  single  regression  including  all  n  observa¬ 
tions.  We  use  the  notation  of  Section  5.6.2. 


Show 


that  ,the 

(*.— *r 


u.  —  i_| _ 


leverages  are  equal  to 


b.  Show  that  sj  =  ^s2  -  {„_k_m_hj). 

c.  Show  that  the  ‘dfbetas’  (for  P)  are  equal  to 

Xj-X 


1  A. 

\/E 


d.  Give  an  interpretation  of  the  results  in  a  and  c  by 
drawing  scatter  plots. 
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e.  Show  that  the  variance  of  the  ‘dfbetas’  for  tjie 
slope  parameter  /?  is  approximately 

with  average  value  1/w.  ^ 

f.  Now  consider  the  multiple  regression  model. 
Show  that  the  average  variance  of  the  ‘dfbetas’ 
in  (5.58)  in  this  case  is  approximately  equal  to 
1  /«  if  the  terms  <?//  are  approximately  constant  — 
that  is,  if  all  least  squares  coefficients  have  ap¬ 
proximately  the  same  standard  error. 


c.  Show  that,  up  to  an  arbitrary  multiplicative 

scaling  constant,  the  only  polynomial  of  degree 
six,  G(e)  =  X!/6=o  Gkek,  that  satisfies  the  five  con¬ 
ditions  has  a  derivative  g(e)  =  dG(e)/de  = 
e(l  —  )2  for  \e\  <  c,  the  so-called  bisquare  func¬ 

tion. 

d.  Make  plots  of  the  functions  G(e)  and  g(e)  and  of 
the  weights  w ,  in  (5.61)  corresponding  to  this 
criterion,  similar  to  the  plots  shown  in  Exhibit 
5.42.  Discuss  the  outcomes. 


5.14  (^?  Section  5.6.4) 

Consider  estimation  of  the  mean  /(  from  a  random 
sample  y,  =  g.  +  e,,  i  =  1,  •  •  • ,  n. 

a.  Let  /(  be  estimated  by  minimizing  the  criter¬ 
ion  X!  Iy<  —  P-\-  Show  that  the  median  m  = 
med(y,,  i  =  1,  •  •  • ,  n)  is  optimal  for  this  criterion 
(distinguish  the  cases  n  odd  and  n  even). 

b.  Show  that  the  median  is  the  maximum  likelihood 
estimator  if  the  disturbances  are  independently 
distributed  with  the  double-exponential  distribu¬ 
tion  with  density  function  f(ei)  =  4e_le'l/“.  Com¬ 
pare  this  distribution  with  the  normal 
distribution,  and  give  an  intuitive  motivation 
why  the  median  will  be  a  better  estimator  than 
the  sample  mean  in  this  case. 

c.  Now  suppose  that  the  disturbances  are  inde¬ 
pendently  t(d)  distributed  with  density 

d-\- 1 

f(£j)  =  cj  (1  +  -fr )  where  d  is  positive  and  q 
is  a  scaling  constant.  Show  that  the  ML  estimator 
is  relatively  insensitive  to  outliers  (for  small 
values  of  d)  by  writing  out  the  first  order  condi¬ 
tion  d  log  (L)/dfi  =  0. 

d.  Show  this  also  by  computing  the  weights  Wj  in 
(5.61)  for  the  f-distribution. 

5.15  (^  Section  5.6.4) 

As  estimation  criterion  we  consider  expressions  of 
the  form  S(6)  =  Ym= i  G(e,-(0)),  where  e,(0)  are 
the  residuals  corresponding  to  8.  We  impose  the 
following  conditions  on  the  function  G:  G(0)  = 
0,  G(e)  =  G(  —  e),  G  is  non-decreasing  in  \e\  and 
constant  for  \e\  >  c  for  a  given  constant  c  >  0,  and 
the  Hessian  of  G  is  continuous. 

a.  Discuss  possible  motivations  for  each  of  these 
five  conditions. 

b.  Let  G  be  a  non-zero  polynomial  of  degree  m  — 
say,  G(e)  =  X/=o  for  \e\  <  c;  then  prove  that 
m  >  6  is  required  to  satisfy  the  five  conditions. 


5.16  (“©  Sections  5.7. 1-5. 7. 3) 

In  this  exercise  we  consider  the  two-step  method 
(5.73)  and  (5.74)  for  the  computation  of  the  IV 
estimator.  We  assume  that  Z  contains  a  constant 
term,  and  we  denote  the  /th  column  of  X  by  Xj 
and  the  /th  column  of  X  by  X/.  So  X;  and  X;  are 
n  x  1  vectors,  with  elements  denoted  by  x;(  and 
Xji,j=  1,  •••,&,  i=  1,  •••,«. 

a.  Prove  that  of  all  linear  combinations  of  the  in¬ 
struments  (that  is,  of  all  vectors  of  the  form 
v  =  Zc  for  some  m  x  1  vector  c),  X;  is  the  linear 
combination  that  has  the  largest  ‘correlation’ 
with  Xj,  in  the  sense  that  it  maximizes 

b.  If  Xj  is  exogenous,  then  it  is  included  in  the  set  of 
instruments.  Prove  that  in  this  case  Xj  in  (5.73)  is 
equal  to  Xj. 

c.  Show  that  the  usual  OLS  estimator  of  the  vari¬ 
ance  based  on  the  regression  in  (5.74)  —  that  is, 
s2  =  (y  —  Xb:v)'(y  —  Xbiv)/(n  —  k)  —  is  not  a 
consistent  estimator  of  a2. 

d.  If  all  the  regressors  in  X  are  exogenous,  then 
prove  that  (var (£>/v)  —  var (b))  is  positive  semide- 
finite.  For  this  purpose,  show  first  that 
X'X  —  X'PzX  is  positive  semidefinite. 

e.  Prove  that  the  P-test  on  linear  restrictions  can  be 
computed  by  (5.78).  It  is  helpful  to  prove  first 
that  e’ReR  -  e'RIVPzeRiv  =  y'y  -  y'Pzy  and  also 
that  e'e  -  e'IVPzeiv  =  y'y  -  y'Pzy- 

5.17*  (^  Section  5.7.3) 

This  exercise  is  concerned  with  the  derivation  of  the 
Hausman  test  for  the  null  hypothesis  that  all  regres¬ 
sors  are  exogenous  against  the  alternative  that  the 
last  ko  regressors  are  endogenous.  We  use  the  nota¬ 
tion  of  Section  5.7.3,  and  we  write  Vi  for  the  ko  x  1 
vector  with  elements  Vjj,  Xe  for  the  n  x  ko  matrix  of 
possibly  endogenous  regressors,  x'ei  for  the  z'th  row 
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of  Xe,  a.  for  the  ko  x  l  vector  with  elements  ay,  and  T 
for  the  ko  x  m  matrix  with  rows  y'.  In  the  equations 
xei  =  Tzi  +  vi  it  is  assumed  that  ty  ~  NID(0,  SI), 
where  ft  is  the  ko  x  ko  covariance  matrix  of  v„  and 
in  the  equation  e,  =  v'fj.  +  Wj  (where  t/a  =  £[e,-|t//])  it 
is  assumed  that  w,  ~  NID(0,  a1).  The  null  hypoth¬ 
esis  of  exogeneity  is  that  a  =  0,  in  which  case 
£,-  =  Wj  ~  NID(0,  a2).  With  this  notation,  the 
model  (5.82)  can  be  written  as 

y,  =  x'fl  +  e,  =  x'fi  +  t/jtx  +  Wj 
=  x'Ji  +  (xei  -  Tzj)'at  +  Wj. 

This  model  is  non-linear  in  the  parameters 
9  =  (a,  /?,  T,  a2,  LD1),  because  of  the  product  term 
T'a. 

a.  Show  that  the  log-likelihood  is  given  by 

/(0)  =  -«log(27t)  +  flogdet(n~1)-I]r”=1QLr1iy 

b.  Show  that  the  ML  estimators  obtained  under  the 

null  hypothesis  that  a  =  0,  are  given  by  y3  =  b, 
f'  =  (Z'Z)~1Z'Xej  a1  =  e'e/n,  and  Cl  =  £yt>', 

where  v,  =  xei  -Tzj. 

c.  Show  that  the  score  vector  dl/d9  of  the  unre¬ 
stricted  log-likelihood,  evaluated  at  the  estimates 
of  b,  is  zero,  with  the  exception  of  dl/da,  which 
is  equal  to  dj  V'e. 

d.  The  Hessian  matrix  (  —  )  is  a  5  x  5  block 

matrix  with  blocks  Brs.  Let  sub-index  1  indicate 
the  blocks  related  to  a  and  sub-index  2  the  blocks 
related  to  /j,  then  show  that  Bn  =  dj  V'V,  that 
B12  =  B'n  =  dj  V'X,  and  that  B22  =  JfX'X.  Fur¬ 
ther  the  following  approximations  may  be  used 
in  e,  but  these  results  need  not  be  proved:  Brs  =  0 


for  all  ( r ,  s)  with  r  =  1,2  and  s  =  3, 4, 5,  and  also 
for  all  (r,  s)  with  r  =  3, 4, 5  and  s  =  1,2. 

e.  Use  the  results  in  c  and  d  to  prove  that  the 
LM-test  computed  according  to  LM  = 

can  be  written  as  LM  = 
dr e'UlU'U)-1!/^,  where  U  is  the  nx  (ko  +  k) 
matrix  U  =  (V  X). 

f.  Prove  that  e  implies  that  LM  =  nR 2  of  the  regres¬ 
sion  in  (5.83). 

5.18  (=©  Section  5.7.1) 

Consider  the  following  model  for  the  relation  be¬ 
tween  macroeconomic  consumption  (C),  disposable 
income  (D),  and  non-consumptive  expenditures  (Z): 
C,  =  a  +  flDj  +  Sj  (the  consumption  equation)  and 
Dj  =  C,  +  Z,  (the  income  equation).  Here  Z  is  as¬ 
sumed  to  be  exogenous  in  the  sense  that  E[Z,£,J  =  0 
for  all  i  =  1,  •  •  • ,  n. 

a.  Prove  that  the  application  of  OLS  in  the  con¬ 
sumption  equation  gives  an  inconsistent  estima¬ 
tor  of  the  parameter  yS. 

b.  Give  a  graphical  illustration  of  the  result  in  a  by 
drawing  a  scatter  plot  of  C  against  D,  and  use 
this  graph  to  explain  why  OLS  is  not  consistent. 

c.  Consider  two  cases  in  b,  one  where  Z  does  not 
vary  at  all  and  another  where  Z  has  a  very  large 
variance. 

d.  Derive  an  explicit  expression  for  the  IV  estimator 
of  yS  in  terms  of  the  observed  variables  C,  D, 
and  Z. 

e.  Use  the  expression  of  d  to  prove  that  this  IV 
estimator  is  consistent. 


EMPIRICAL  AND  SIMULATION  QUESTIONS 

5.19  (“®  Section  5.3.3) 

a.  Generate  a  sample  of  size  100  from  the  model 
yij  =  2  +  sjxj  +  Ej  where  the  x,  are  independent 
and  uniformly  distributed  on  the  interval  [0,  20] 
and  the  £,  are  independent  and  distributed  as 
N(0,  0.01). 

b.  Regress  y  on  a  constant  and  x.  Perform  a  RESET 
test  and  a  Chow  forecast  test.  Analyse  the  recur¬ 
sive  residuals  and  the  CUSUM  and  CUSUMSQ 
plots. 


c.  Answer  the  same  questions  after  the  data  have 
been  ordered  with  increasing  values  of  x. 

d.  Estimate  the  model  y  =  a  +  fix(X)  +  e  by  ML, 
where  x(A)  =  (xA  —  l)/2. 

e.  For  the  estimated  value  of  2,  regress  y  on  a  con¬ 
stant  and  x(2)  and  analyse  the  corresponding 
recursive  residuals  and  CUSUM  and  CUSUMSQ 
plots.  Perform  also  a  RESET  test  and  a  Chow 
forecast  test. 
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5.20  (“©  Sections  5.4.2,  5.4.3) 

Simulate  n  =  100  data  points  as  follows.  Let  x, 
consist  of  100  random  drawings  from  the  standard 
normal  distribution,  let  t]t  be  a  random  drawing 
from  the  distribution  N(0,  xf),  and  let  y,  =  Xj  + 
We  will  estimate  the  model  y,  =  /be,  +  £,. 

a.  Estimate  P  by  OLS.  Estimate  the  standard  error 
of  b  both  in  the  conventional  way  and  by  White’s 
method. 

b.  Estimate  fl  by  WLS  using  the  knowledge  that 
of  =  a2xj.  Compare  the  estimate  and  the  stand¬ 
ard  error  obtained  for  this  WLS  estimator  with 
the  results  for  OLS  in  a. 

c.  Now  estimate  /I  by  WLS  using  the  (incorrect) 
heteroskedasticity  model  of  =  a1  /xj.  Compute 
the  standard  error  of  this  estimate  in  three 
ways  —  that  is,  by  the  WLS  expression  corres¬ 
ponding  to  this  (incorrect)  model,  by  the  White 
method  for  OLS  on  the  (incorrectly)  weighted 
data,  and  also  by  deriving  the  correct  formula 
for  the  standard  deviation  of  WLS  with  this  in¬ 
correct  model  for  the  variances. 

d.  Perform  1000  simulations,  where  the  n  =  100 
values  of  x,  remain  the  same  over  all  simulations 
but  the  100  values  of  >7,  are  different  drawings 
from  the  N(0,x?)  distributions  and  where  the 
values  of  y,-  =  x,  +  differ  accordingly  between 
the  simulations.  Determine  the  sample  standard 
deviations  over  the  1000  simulations  of  the  three 
estimators  of  P  in  a,  b,  and  c  —  that  is,  OLS,  WLS 
(with  correct  weights),  and  WLS  (with  incorrect 
weights). 

e.  Compare  the  three  sample  standard  deviations 
in  d  with  the  estimated  standard  errors  in  a,  b, 
and  c,  and  comment  on  the  outcomes.  Which 
standard  errors  are  reliable,  and  which  ones  are 
not? 

5.21  ('’a?  Section  5.5.3) 

a.  Generate  a  sample  of  size  n  =  100  from  the 
model  y,-  =  2  +  sjx-,  +  £,-,  where  the  x,  are  inde¬ 
pendent  and  uniformly  distributed  on  the  inter¬ 
val  fO,  20]  and  the  £,-  are  independent  and 
distributed  as  N(0,  0.01).  Regress  y  on  a  constant 
and  x  and  apply  tests  on  serial  correlation. 

b.  Sort  the  data  of  a  with  increasing  values  of  x. 
Again  regress  y  on  a  constant  and  x  and  apply 
tests  on  serial  correlation.  Save  the  residual  series 
for  later  use  in  e. 


c.  Generate  a  sample  of  size  n  =  100  from  the 
linear  model  y,  =  2  +  x,-  +  £,-,  where  the  x,  are 
independent  and  uniformly  distributed  on  the 
interval  [0,  20]  and  the  £,  are  independent  and 
distributed  as  N(0,  0.01).  Regress  y  on  a  constant 
and  x  and  apply  tests  on  serial  correlation. 

d.  Sort  the  data  of  c  with  increasing  values  of  the 
residuals  e.  Again  regress  y  on  a  constant  and  x 
and  apply  tests  on  serial  correlation. 

e.  Explain  the  results  in  b  and  d  by  considering 
relevant  scatter  diagrams. 

f.  Discuss  the  relevance  of  your  findings  for  the 
interpretation  of  serial  correlation  tests  (like 
Durbin-Watson)  for  cross  section  data. 

5.22*  {-*>  Section  5.5.2) 

In  this  exercise  we  simulate  data  with  the  model 
y,-  =  fixi  +  £,-,/=  1,  •••,  n,  where  x,  and  £,-  are 
both  generated  by  an  AR(1)  model.  That  is, 
x,  =  px,_i  +  ujj  (with  xo  =  0)  and  £,-  =  y£,_i  +  >/, 
(with  £q  =  0),  where  w,  and  ?/,  are  both  NID(0,1) 
with  Ui  and  independent  for  all  i,j.  The  param¬ 
eters  p  and  y  satisfy  —  1  <  p  <  1  and  —  1  <  y  <  1. 
The  OLS  estimator  of  p  is  given  by  b  = 
E”=t  xiyi/  YH=\  xf->  and  the  conventional  OLS  for¬ 
mula  for  the  variance  is  var (b)  =  s2 /  E/Li  xj  where 
s2  =  Eil t  (Vi  ~  bxi)2/{n  -  1). 

a.  Prove  that,  for  i  — »  oo,  the  correlation  between  x, 
and  x,_£  converges  to  pk  and  the  correlation 
between  £,-  and  £,_£  to  yk.  Prove  also  that,  for 
i  — >  oo,  the  variance  of  X;  converges  to 
1/(1  —  p2)  and  the  variance  of  £,-  to  1/(1  —  y2). 

b.  Prove  that,  although  the  regressors  are  stochastic 
here,  the  OLS  estimator  b  is  unbiased  and  con¬ 
sistent  in  this  case. 

c.  Prove  that,  for  n  —>  oo,  the  true  variance  of  b  is 
not  given  by  the  OLS  formula  vaf(b) 
=  s2/E;U  x2,  but  that  it  is  approximately 
equal  to  var (b)\^,.  Use  the  fact  that  s2  is  a 
consistent  estimator  of  the  variance  1/(1  — y2) 
of  the  disturbances  £,. 

d.  Simulate  two  data  sets  of  size  n  =  100,  one  in  the 
model  with  /?  =  0  and  the  other  one  in  the  model 
with  P  =  1.  For  both  simulations,  take 
p  =  y  =  0.7.  For  both  data  sets,  regress  y  on  x 
and  compute  the  OLS  standard  error  of  b  and 
also  the  HAC  standard  error  of  b.  For  the  data 
generated  with  P  =  0,  test  the  null  hypothesis 
that  P  =  0  against  the  (two-sided)  alternative 


Exercises  433 


that  fi  /  0  (at  5%  significance),  using  the  two 
f-values  obtained  by  the  OLS  and  the  HAC 
standard  errors  of  b. 

e.  Repeat  the  simulation  of  d  1000  times.  For  the 
model  with  fi  =  0,  compute  the  frequency  of  re¬ 
jection  of  the  null  hypothesis  that  /l  =  0  for  the  t- 
tests  based  on  the  OLS  and  the  HAC  standard 
errors  of  b. 

f.  For  each  of  the  two  data  generating  processes, 
compute  the  standard  deviation  of  the  estimates 
b  over  the  1000  simulations  and  also  the  mean  of 
the  1000  reported  OLS  standard  errors  and  of  the 
1000  reported  HAC  standard  errors.  Compare 
these  values  and  relate  them  to  the  outcomes  in  e. 

g.  Relate  the  outcomes  in  f  also  to  the  result 
obtained  in  c. 

h.  Comment  on  the  relevance  of  your  findings  for 
significance  tests  of  regression  coefficients  if 
serial  correlation  is  neglected. 

5.23  (“®  Sections  5.7.1,  5.7.3) 

In  this  exercise  we  consider  simulated 
data  on  the  relation  between  police  ( x ) 
and  crime  (y).  Some  of  the  data  refer  to 
election  years  (z  =  1),  the  other  data  to  non-election 
years  (z  =  0).  We  want  to  estimate  the  effect  of 
police  on  crime  —  that  is,  the  parameter  fi  in  the 
model  yj  =  a  +  fixi  +  £;. 

a.  Regress  ‘crime’  on  a  constant  and  ‘police’.  Give  a 
possible  explanation  of  the  estimated  positive 
effect. 

b.  Give  a  verbal  motivation  why  the  election 
dummy  z  could  serve  as  an  instrument. 

c.  Show  that  the  IV  estimator  of  fi  is  given  by 
(y1  —yo)/{x\  —  3co),  where  y1  denotes  the  sample 
mean  of  y  over  election  years  and  y0  over  non¬ 
election  years  and  where  X\  and  xq  are  defined  in 
a  similar  way.  Give  also  an  intuitive  motivation 
for  this  estimator  of  fi. 

d.  Use  the  data  to  estimate  fi  by  instrumental  vari¬ 
ables,  using  z  (and  a  constant)  as  instruments. 
Check  that  the  result  of  c  holds  true.  Give  an 
interpretation  of  the  resulting  estimate. 

e.  Perform  the  Hausman  test  on  the  exogeneity  of 
the  variable  x. 


5.24  (“©  Section  5.3.3) 

Consider  the  data  of  Example  5.9 
(ordered  with  education),  which  showed 


a  break  at  observation  366  (education  at  least  16 
years)  in  the  marginal  effect  fi  of  education  on  salar¬ 
ies  (see  Exhibit  5.15). 

a.  Check  the  outcomes  on  a  break  (at  observation 
425  for  the  Chow  tests)  discussed  in  Example  5.9. 

b.  Formulate  a  model  with  two  different  values  of  fi 
in  (5.16),  one  for  education  levels  less  than  16 
years  (observations  i  <  365)  and  another  for 
education  levels  of  16  years  or  more  (observa¬ 
tions  i  >  366).  Estimate  this  model,  and  give  an 
interpretation  of  the  outcomes. 

c.  Perform  Chow  break  tests  and  Chow  forecast  tests 
(with  the  break  now  located  at  observation  366). 

d.  Perform  a  sequence  of  Chow  break  tests  for  all 
segments  where  the  variable  ‘education’  changes. 
This  variable  takes  on  ten  different  values,  so 
that  there  are  nine  possible  break  points.  Com¬ 
ment  on  the  outcomes. 

e.  Perform  also  a  sequence  of  Chow  forecast  tests 
and  give  an  interpretation  of  the  outcomes. 

5.25  (“®  Sections  5.4.4,  5.4.5) 

Consider  the  salary  data  of  Example  5.15 
with  the  regression  model  discussed  in 
that  example.  In  this  exercise  we  adjust 
the  model  for  the  variances  as  follows: 
E[ef]  =  yt  +  y2D2i  +  y3D3i  +  y4x,  +y5xf  —  that  is, 
the  model  for  the  variances  is  additive  and  contains 
also  effects  of  the  level  of  education. 

a.  Estimate  the  eleven  parameters  (six  regression 
parameters  and  five  variance  parameters)  by 
(two-step)  FWLS  and  compare  the  outcomes 
with  the  results  in  Exhibit  5.22. 

b.  Check  that  the  data  in  the  data  file  are  sorted 
with  increasing  values  of  x,.  Inspect  the  histo¬ 
gram  of  Xi  and  choose  two  subsamples  to  per¬ 
form  the  Goldfeld-Quandt  test  on  possible 
heteroskedasticity  due  to  the  variable  Xj. 

c.  Perform  the  Breusch-Pagan  test  on  heteroskedas¬ 
ticity,  using  the  specified  model  for  the  variances. 

d.  Also  perform  the  White  test  on  heteroskedasticity. 

e.  Comment  on  the  similarities  and  differences  be¬ 
tween  the  test  outcomes  in  b-d. 

5.26  (”©  Section  5.3.1) 

In  this  exercise  we  consider  data  on 
weekly  coffee  sales  (for  brand  1).  In  total 
there  are  n  =  18  weekly  observations, 
namely  six  weeks  without  any  marketing  actions, 
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six  weeks  with  price  reductions  without  advertise¬ 
ment,  and  six  weeks  with  joint  price  reductions  and 
advertisement.  In  Example  5.7  we  considered  similar 
coffee  data,  but  only  for  the  twelve  weeks  without 
advertisement.  Now  we  will  shift  the  attention  to 
another  subset  of  the  data,  and  we  restrict  the  atten¬ 
tion  to  sales  in  the  twelve  weeks  with  marketing 
actions.  As  there  are  no  advertisements  without  sim¬ 
ultaneous  price  reductions,  we  formulate  the  model 

y  =  Pi  +  PiDp  +  fcDfl  +  /?4 DpDa  +  e, 


c.  Investigate  whether  the  disturbances  in  the  non¬ 
linear  model  are  heteroskedastic.  In  particular, 
investigate  whether  the  disturbance  variance  is 
related  to  the  group  size. 

d.  Discuss  how  the  non-linear  model  can  be  esti¬ 
mated  in  case  of  heteroskedasticity  related  to  the 
group  size.  Estimate  this  model  and  compare  the 
outcomes  (especially  the  regression  coefficients 
and  the  standard  errors)  with  the  results  in 
Example  5.25  (see  Exhibit  5.34). 


where  y  denotes  the  logarithm  of  weekly  sales,  Dp  is 
a  dummy  variable  with  the  value  0  if  the  price 
reduction  is  5%  and  the  value  1  if  this  reduction  is 
15%,  and  Da  is  a  dummy  variable  that  is  0  if  there  is 
no  advertisement  and  1  if  there  is  advertisement. 

a.  Give  an  economic  motivation  for  the  above  model. 
Estimate  this  model  and  test  the  null  hypothesis 
that  p2  =  0.  What  is  the  P-value  of  this  test? 

b.  Estimate  the  above  model,  replacing  Da  by  the 
alternative  dummy  variable  D*,  which  has 
the  value  0  if  there  is  advertisement  and  1  if 
there  is  not.  The  model  then  becomes 
y  =  P\  +  Pi Dp+  P*3Da  +  P*4DpD*  +  e.  Compare 
the  estimated  price  coefficient  and  its  f-value 
and  P-value  with  the  results  obtained  in  a. 

c.  Explain  why  the  two  results  for  the  price  dummy 
differ  in  a  and  b.  Discuss  the  relevance  of  this  fact 
for  the  interpretation  of  coefficients  of  dummy 
variables  in  regression  models. 

d.  Derive  the  four  relations  between  the  parameters 
Pi  and  P*,i=  1,  •  •  • ,  4,  in  the  two  models.  Check 
that  the  two  sets  of  regression  parameters  satisfy 
the  same  relations.  Relate  this  result  to  c. 

5.27  (“®  Sections  5.5.1,  5.5.4) 

In  this  exercise  we  consider  the  budget 
data  of  Example  5.20,  ordered  in  seg¬ 
ments  as  discussed  in  Example  5.20.  We 
consider  both  the  linear  model  of  Section  5.5.1  and 
the  non-linear  model  of  Section  5.5.4  for  the  rela¬ 
tion  between  the  fraction  of  expenditures  spent  on 
food  (y),  the  total  consumption  expenditures  (%2, 
in  $10,000  per  year),  and  the  average  household 
size  (x3). 

a.  Apply  OLS  in  the  linear  model  and  perform  a 
RESET. 

b.  Apply  recursive  least  squares  in  the  linear  model 
and  perform  a  CUSUM  test. 


5.28  (“©  Sections  5.4.4,  5.6.3) 

In  this  exercise  we  consider  monthly  data 
of  the  three-month  Treasury  Bill  rate  (r,) 
in  the  USA  from  January  1985  to  Decem¬ 
ber  1999.  In  Example  5.11  we  considered  the 
monthly  changes  xI  =  r,  —  r,_i.  We  consider  the 
following  simple  model  for  the  relation  of  these 
changes  to  the  level  of  this  interest  rate: 
r,  —  r,_  i  =  a  +  +  £;.  In  financial  economics, 

several  models  are  proposed  for  the  variance  of  the 
unpredicted  changes  £,.  We  consider  models  of  the 
form  E[sf]  =  (P'r^,  so  that  the  vector  of  unknown 
parameters  is  given  by  9  =  (a,  P,  y,  a2)1.  The  Vasicek 
model  postulates  that  y  =  0,  the  Cox-Ingersoll- 
Ross  model  that  y  =  1  /2,  and  the  Brennan- 
Schwartz  model  that  y  =  1. 

a.  Estimate  the  four  parameters  in  9  by  (two-step) 
FWLS. 

b.  Estimate  9  by  maximum  likelihood,  assuming 
that  the  error  terms  £,  are  normally  distributed. 
Compare  the  estimates  with  the  ones  obtained  in 
a. 

c.  Test  the  three  hypotheses  that  y  =  0,  y=  1/2, 
and  y  =  1,  both  by  the  Wald  test  and  by  the 
Likelihood  Ratio  test.  What  is  your  conclusion? 

d.  Test  the  hypothesis  of  normally  distributed  error 
terms  e,  by  means  of  the  ML  residuals  of  b.  What 
is  your  conclusion? 

5.29  (-*>  Sections  5.5.4,  5.6.2,  5.6.4) 

In  this  exercise  we  consider  the  quarterly 
series  of  industrial  production  (y,-,  in  loga¬ 
rithms)  for  the  USA  over  the  period 
1950.1-1998.3.  These  data  were  discussed  in 
Example  5.26  (see  Exhibit  5.36  to  get  an  idea  of 
this  series). 

a.  Estimate  the  linear  trend  model  y,-  =  os  +  [ii  +  e, 
and  test  whether  the  slope  P  is  constant  over  the 
sample. 
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b.  Now  include  seasonal  dummies  to  account  for 
possible  seasonal  effects.  Test  for  the  individual 
and  joint  significance  of  the  seasonal  dummies. 

c.  Investigate  the  presence  of  outliers  in  this  model. 

d.  Let  A4y,-  =  y,  —  y,_4  be  the  yearly  growth  rate; 
then  estimate  the  model  A4y,  =  fi  +  £ ,-  by  OLS. 
What  are  the  leverages  in  this  model?  Investigate 
the  presence  of  outliers  in  this  model. 

e.  How  would  you  estimate  the  yearly  growth  rate 
of  industrial  production,  by  the  sample  mean,  by 
the  median,  or  in  another  way?  Motivate  your 
answer. 

5.30  Sections  5.5.3,  5.6.2) 

In  Example  5.27  we  considered  the 
CAPM  for  the  sector  of  cyclical  consumer 
goods.  In  addition  we  now  also  consider 
the  sector  of  non-cyclical  consumer  goods. 

a.  Perform  tests  for  heteroskedasticity  and  serial 
correlation  in  the  CAPM  for  the  sector  of  cyclical 
consumer  goods. 

b.  Answer  a  also  for  the  sector  of  non-cyclical  con¬ 
sumer  goods. 

c.  Investigate  the  presence  and  nature  of  influential 
observations  in  the  CAPM  for  the  sector  of  non- 
cyclical  consumer  goods. 

d.  Discuss  the  relevance  of  the  possible  presence  of 
heteroskedasticity  and  serial  correlation  on  the 
detection  of  influential  observations. 

5.31  (”©  Sections  5.3.3,  5.4.5,  5.5.3, 

5.6.2,  5.6.3) 

In  Example  5.31  we  considered  data  on 
gasoline  consumption  (GC),  price  of  gas¬ 
oline  (PG),  and  real  income  (RI)  over  the  years 
1970-99.  In  all  tests  below  use  a  significance  level 
of  5%. 

a.  Estimate  the  model  GC,-  =  a.  +  /IPG,  +  yR/,  +  £,-, 
using  the  data  over  the  period  1970-95. 

b.  Perform  a  test  on  parameter  constancy  over  this 
period. 

c.  Perform  a  test  for  heteroskedasticity  over  this 
period. 

d.  Perform  a  test  for  serial  correlation  over  this 
period. 

e.  Perform  a  test  on  outliers  and  a  test  on  normality 
of  the  disturbances  over  this  period. 


f.  Perform  a  Chow  forecast  test  for  the  quality  of 
the  model  in  a  in  forecasting  the  gasoline  con¬ 
sumption  in  the  years  1996-99,  for  given  values 
of  the  explanatory  variables  over  this  period. 

5.32  (“®  Section  5.8) 

In  Section  5.8  we  considered  the  relation 
between  the  salary  of  top  managers  and 
the  profits  of  firms  for  the  100  largest 
firms  in  the  Netherlands  in  1999.  We  postulated 
the  model  y-,  =  a  +  fixi  +  £;,  where  y-t  is  the  average 
salary  of  top  managers  of  firm  i  and  x,  is  the  profit  of 
firm  i  (both  in  logarithms). 

a.  Discuss  whether  you  find  the  seven  standard  as¬ 
sumptions  of  the  regression  model  intuitively 
plausible. 

b.  Check  the  results  of  diagnostic  tests  reported  in 
Exhibit  5.49  (for  the  sample  of  n  =  96  firms  with 
positive  profits). 

c.  When  the  model  is  estimated  for  the  forty-eight 
firms  with  the  smallest  (positive)  profits,  then  no 
significant  relation  is  found.  Check  this,  and  dis¬ 
cuss  the  importance  of  this  finding  for  a  top 
manager  of  a  firm  with  small  profits  who  wishes 
to  predict  his  or  her  salary. 

5.33  (“®  Sections  5.3.3,  5.4.3,  5.6.2) 

In  this  exercise  we  consider  data  on  the 
US  presidential  election  in  2000.  The  data 
hie  contains  the  number  of  votes  on  the 
different  candidates  in  the  n  =  67  counties  of  the 
state  Florida,  before  recounting.  The  county  Palm 
Beach  is  observation  number  i  =  50.  The  recounts  in 
Florida  were  motivated  in  part  by  possible  mistakes 
of  voters  in  Palm  Beach  who  wanted  to  vote  for 
Gore  (the  second  candidate,  but  third  punch  hole 
on  the  ballot  paper)  but  by  accident  first  selected 
Buchanan  (second  punch  hole  on  the  ballot  paper). 
This  resulted  in  ballot  papers  with  multiple  punch 
holes.  The  difference  (before  recounts)  between 
Bush  and  Gore  in  the  state  Florida  was  975  votes 
in  favour  of  Bush. 

a.  Perform  a  regression  of  the  number  of  votes  on 
Buchanan  on  a  constant  and  the  number  of  votes 
on  Gore.  Investigate  for  the  presence  of  outliers. 

b.  Estimate  the  number  of  votes  v  in  Palm  Beach 
county  that  are  accidentally  given  to  Buchanan 
by  including  a  dummy  variable  for  this  county  in 
the  regression  model  of  a.  Test  the  hypothesis 
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that  v  <  975  against  the  alternative  that 
v  >  975. 

c.  The  counties  differ  in  size  so  that  the  error  terms 
in  the  regression  in  a  may  be  heteroskedastic. 
Perform  the  Breusch-Pagan  test  on  heteroskedas- 
ticity  of  the  form  of  =  h(y1  +  y2 ;?/),  where  n,  de¬ 
notes  the  total  number  of  votes  on  all  candidates 
in  county  i. 

d.  Answer  b  and  c  also  for  the  model  where  the 
fraction  of  votes  (instead  of  the  number  of 
votes)  on  Buchanan  in  each  county  is  explained 


in  terms  of  the  fraction  of  votes  on  Gore  in  that 
county.  For  the  Breusch-Pagan  test  consider  het- 
eroskedasticity  of  the  form  of  =  h(yt  +  y2  f ). 

e.  Formulate  an  intuitively  plausible  model  for  the 
variance  of  the  disturbance  terms  in  the  regres¬ 
sion  model  of  a,  using  the  results  of  the  Breusch- 
Pagan  tests  in  c  and  d.  Answer  b  using  a  regres¬ 
sion  equation  with  appropriately  weighted  data. 

f.  Discuss  and  investigate  whether  the  assumptions 
that  are  needed  for  the  (politically  important) 
conclusion  of  e  are  plausible  for  these  data. 


6 


Qualitative  and  Limited 
Dependent  Variables 


In  this  chapter  we  consider  dependent  variables  with  a  restricted  domain  of 
possible  outcomes.  Binary  variables  have  only  two  possible  outcomes  (‘yes’ 
and  ‘no’);  other  qualitative  variables  can  have  more  than  two  but  a  finite 
number  of  possible  outcomes  (for  example,  the  choice  between  a  limited 
number  of  alternatives).  It  may  also  be  that  the  outcomes  of  the  dependent 
variable  are  restricted  to  an  interval.  For  instance,  for  individual  agents  the 
amount  of  money  spent  on  luxury  goods  or  the  duration  of  unemployment  is 
non-negative,  with  a  positive  probability  for  the  outcome  ‘zero’.  For  all  such 
types  of  dependent  variables,  the  linear  regression  model  with  normally 
distributed  error  terms  is  not  suitable.  We  discuss  probit  and  logit  models 
for  qualitative  data,  tobit  models  for  limited  dependent  variables,  and 
models  for  duration  data. 

Section  6.1  is  the  basic  section  of  this  chapter  and  it  is  required  for  the 
material  discussed  in  Sections  6.2  and  6.3.  These  last  two  sections  can  be 
read  independently  from  each  other. 
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6.1  Binary  response 


6.1.1  Model  formulation 

Uses  Chapters  1-4;  Sections  5.4  and  5.6. 

Motivation 

Students  may  succeed  in  finishing  their  studies  or  they  may  drop  out,  house¬ 
holds  may  buy  a  trendy  new  product  or  not,  and  individuals  may  respond  to  a 
direct  mailing  or  not.  In  all  such  cases  the  variable  of  interest  can  take  only  two 
possible  values.  Such  variables  are  called  binary.  The  two  outcomes  will  be 
labelled  as  1  (‘success’)  and  0  (‘failure’).  The  simplest  statistical  model  to 
describe  a  binary  variable  y  is  the  Bernoulli  distribution  with  P[y  =  1]  =  p  and 
P[y  =  0]  =  1  —  p.  However,  it  may  well  be  that  the  probability  of  success 
differs  among  individuals,  and  in  this  section  we  are  interested  in  modelling 
the  possible  causes  of  these  differences.  For  instance,  the  probability  of  success 
for  students  in  their  studies  will  depend  on  their  intelligence,  the  probability  of 
buying  a  new  trendy  product  will  depend  on  income  and  age,  and  the  prob¬ 
ability  of  a  response  to  a  direct  mailing  will  depend  on  relevant  interests  of  the 
individuals. 

Assumptions  on  explanatory  variables 

As  before,  for  individual  i  the  values  of  k  explanatory  variables  are  denoted 
by  the  k  x  1  vector  x,  and  the  outcome  of  the  binary  dependent  variable  is 
denoted  by  yt.  We  will  always  assume  that  the  model  contains  a  constant  term 
and  that  x\,  =  1  for  all  individuals.  Throughout  this  chapter  we  will  treat  the 
explanatory  variables  as  fixed  values,  in  accordance  with  Assumption  1  in 
Section  3.1.4  (p.  125).  However,  as  was  discussed  in  Section  4.1,  in  practice 
all  data  (both  y,  and  x,)  are  often  stochastic.  This  is  the  case,  for  instance, 
when  the  observations  are  obtained  by  random  sampling  from  an  underlying 
population,  and  this  is  the  usual  situation  for  the  types  of  data  considered  in 
this  chapter.  All  the  results  of  this  chapter  carry  over  to  the  case  of  exogenous 
stochastic  regressors,  by  interpreting  the  results  conditional  on  the  given 
outcomes  of  x„  i  =  1,  •  ■  ■  ,n.  This  kind  of  interpretation  was  also  discussed 
in  Section  4.1.2  (p.  191). 
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The  linear  probability  model 

For  a  binary  dependent  variable,  the  regression  model 

k 

y,  =  x'fi  +  £,-  =  /F|  +  y  PjXji  +  Si,  E[sj]  =  0  (6.1) 

7=2 

is  called  the  linear  probability  model.  As  £[e,]  =  0  and  y,  can  take  only  the 
values  zero  and  one,  it  follows  that  xfi  =  £[y,]  =  0  ■  P[y,  =  0]  +1  •  P[y,  =  1], 
so  that 


P[y,  =  1]  =  E[y,\  =  x'fi.  (6.2) 

Note  that  we  write  P[y,  =  1]  =  x'/i —  that  is,  the  subindex  /  of  y,  indicates 
that  we  deal  with  an  individual  with  characteristics  x,.  This  can  be 
written  more  explicitly  as  P[y,  =  l|x,],  but  for  simplicity  of  notation  we 
delete  the  conditioning  on  x,.  Similar  shorthand  notations  will  be  used 
throughout  this  chapter.  In  the  linear  probability  model,  x'fi  measures  the 
probability  that  an  individual  with  characteristics  x,  will  make  the  choice 
y,  =  1,  so  that  the  marginal  effect  of  the  /th  explanatory  variable  is  equal  to 

dP[y ,  =  1  \/dxj,  =  Pp  j  =  2,---,k. 

Disadvantages  of  the  linear  model 

The  linear  probability  model  has  several  disadvantages.  It  places  implicit 
restrictions  on  the  parameters  p ,  as  (6.2)  requires  that  0  <  x'fi  <  1  for  all 
i  —  1,  ■  •  • ,  n.  Further,  the  error  terms  e,  are  not  normally  distributed.  This  is 
because  the  variable  y,  can  take  only  the  values  zero  and  one,  so  that  e,  is  a 
random  variable  with  discrete  distribution  given  by 

e,  =  1  —  x'Ji  with  probability  x'/i 
Sj  =  —x'fi  with  probability  1  —  x'fi. 

The  distribution  of  e,  depends  on  x,  and  has  variance  equal  to  var(e() 
=  x'fi(  1  —  xfi),  so  that  the  error  terms  are  heteroskedastic  with  variances 
that  depend  on  p.  The  assumption  that  £[t:,]  =  0  in  (6.1)  implies  that  OLS  is 
an  unbiased  estimator  of  p  (provided  that  the  regressors  are  exogenous),  but 
clearly  it  is  not  efficient  and  the  conventional  OLS  formulas  for  the  standard 
errors  do  not  apply.  Further,  if  the  OLS  estimates  b  are  used  to  compute  the 
estimated  probabilities  P[y,  =  1]  =  x'fi,  then  this  may  give  values  smaller 
than  zero  or  larger  than  one,  in  which  case  they  are  not  real  ‘probabilities’. 
This  may  occur  because  OLS  neglects  the  implicit  restrictions  0  <  xfi  <  1. 
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Non-linear  model  for  probabilities 

The  probabilities  can  be  confined  to  values  between  zero  and  one  by  using  a 
non-linear  model.  Let  F  be  a  function  with  values  ranging  between  zero  and 
one,  and  let 

P[y,  =  1]  =  F(x'jS).  (6.3) 

For  the  ease  of  interpretation  of  this  model,  the  function  F  is  always  taken 
to  be  monotonically  non-decreasing.  In  this  case,  if  ff  >  0,  then  an  increase 
in  Xji  leads  to  an  increase  (or  at  least  not  to  a  decrease)  of  the  probability 
that  y,  =  1.  That  is,  positive  (negative)  coefficients  correspond  to  positive 
(negative)  effects  on  the  probability  of  success.  An  obvious  choice  for 
the  function  F  is  a  cumulative  distribution  function.  This  is  illustrated  in 
Exhibit  6.1. 

Marginal  effects  on  probabilities 

In  the  model  (6.3)  x'/l  can  be  interpreted  as  the  strength  of  the  stimulus  for 
the  outcome  y,  =  1,  with  P[y,  =  1]  =  F(x'/1)  — >  1  if  x'fi  — >  oo  and 
P[y,  =  1]  —»■  0  if  x'fi  —*  —  oo.  Assuming  that  F  is  differentiable  with  derivative 
f  (the  density  function  corresponding  to  F ),  the  marginal  effect  of  the  /th 
explanatory  variable  is  given  by 

dP[y‘^  1]  =  fix'Mn  1  =  2,  •  •  • ,  k.  (6.4) 

(a) 

1.2 

1.0 

0.8 

0.6 

0.4 

0.2 

0.0 

-0.2 

0  50  100  150  200  250  0  50  100  150  200  250 

X  X 

Exhibit  6.1  Probability  Models 

Binary  dependent  variable  (y  takes  value  0  or  1)  with  linear  probability  model  (a)  and  with 
non-linear  probability  model  in  terms  of  a  cumulative  distribution  function  ( b ),  for  a  single 
explanatory  variable  (x). 
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This  shows  that  the  marginal  effect  of  changes  in  the  explanatory  variables 
depends  on  the  level  of  these  variables.  Usually,  the  density  function  f  has 
relatively  smaller  values  in  the  tails  and  relatively  larger  values  near  the  mean, 
so  that  the  effects  are  smallest  for  individuals  for  which  P[yt  =  1]  is  near  zero 
(in  the  left  tail  of  f)  or  near  one  (in  the  right  tail  of  f).  This  conforms  with  the 
intuition  that  individuals  with  clear-cut  preferences  are  less  affected  by 
changes  in  the  explanatory  variables.  The  sensitivity  of  decisions  to  changes 
in  the  explanatory  variables  depends  on  the  shape  of  the  density  function  f.  It 
is  usually  assumed  that  this  density  has  mean  zero,  which  is  no  loss  of 
generality,  because  the  explanatory  variables  include  a  constant  term.  Further 
it  is  usually  assumed  that  the  density  is  unimodal  and  symmetric,  so  that  f(t) 
is  maximal  for  t  —  0  and  f(t)  =  f(  —  t)  for  all  t.  Then  the  marginal  effects  are 
maximal  for  values  of  x'fi  around  zero,  where  P[yt  =  1]  is  around  1/2. 

Restriction  needed  for  parameter  identification 

The  standard  deviation  of  the  density  f  should  be  specified  before¬ 
hand.  Indeed,  if  g(t)  =  of  (at),  then  the  cumulative  distribution  functions 
(G  of  g  and  F  of  f)  are  related  by  G(t)  =  F(ot),  so  that  P[yi  =  1] 
=  F{x'iP)  =  G{x'iP/o).  That  is,  the  model  (6.3)  with  function  F  and 
parameter  vector  p  is  equivalent  to  the  model  with  function  G  and  parameter 
vector  P/o.  So  the  variance  of  the  distribution  f  should  be  fixed,  independent 
of  the  data,  as  otherwise  the  parameter  vector  p  is  not  identified. 

Interpretation  of  model  in  terms  of  latent  variables 

The  model  (6.3)  can  be  given  an  interpretation  in  terms  of  an  unobserved 
variable  y*  that  represents  the  latent  preference  of  individual  i  for  the  choice 
y,  =  1.  It  is  assumed  that 


y*  =  x\p  +  !■.„  s,  ~  IID,  £[e(]  =  0. 

This  is  the  so-called  index  function,  where  x'fi  is  the  systematic  preference 
and  e,  the  individual-specific  effect.  This  takes  the  possibility  into  account 
that  individuals  with  the  same  observed  characteristics  x  may  make  different 
choices  because  of  unobserved  individual  effects.  The  observed  choice  y  is 
related  to  the  index  y*  by  means  of  the  equation 

y,  =  1  if  y*  >  0, 

y,  =  0  if  y*  <  0. 

It  is  assumed  that  the  individual  effects  s,  are  independent  and  identically 
distributed  with  symmetric  density  f  —  that  is,  f(et)  =  f(  —  £,)•  It  then  follows 
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that  P[ei  >  —  t]  =  fxtf(s)ds  =  ft_oc  f(s)ds  =  P[ei<t],  so  that  P[yi  =  1] 
=  P[e,  >  — x-/J]  =  P[e,  <  x'/I]  =  P(x-/ ?),  where  P  is  the  cumulative  distribu¬ 
tion  function  of  £,.  This  provides  an  interpretation  of  the  model  (6.3)  in  terms 
of  differences  in  the  individual  effects  £,  over  the  population. 

Interpretation  of  model  in  terms  of  utilities 

Another  possible  interpretation  of  the  model  (6.3)  is  in  terms  of  the  utilities 
U°  and  U 1  of  the  two  alternative  choices.  The  utilities  for  individual  i  are 
defined  by 


Uf  —  x'jP  0  +  e0  it  Uj  —  x’fii  +  EH- 

The  alternative  with  maximal  utility  is  chosen,  so  that 

Vi  =  1  if  U,°  <  Uj, 
y,  =  0  if  U?>U}. 

In  this  case  the  choice  depends  on  the  difference  in  the  utilities  Uj  —  Uf 
=  x\$  +  £,,  where  P  =  Pi  —  do  an<f  £;  =  £i»  —  £0i-  Again,  if  the  individual- 
specific  terms  £,  are  assumed  to  be  independent  and  identically  distributed 
with  symmetric  density  f,  it  follows  that  P[y,  =  1]  =  P[fi,  >  -m 
=  P[fi/  <  x\P\  =  P(x'/i).  So  this  motivates  the  model  (6.3)  in  terms  of  unob¬ 
served  individual  effects  in  the  utilities  of  the  two  alternatives. 
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Example  6.1 :  Direct  Marketing  for  Financial  Product 

To  illustrate  the  modelling  of  binary  response  data,  we  consider  data  that 
were  collected  in  a  marketing  campaign  for  a  new  financial  product  of  a 
commercial  investment  firm  (Robeco).  We  will  discuss  (i)  the  motivation  of 
the  marketing  campaign,  and  (ii)  the  data  set. 

(i)  Motivation  of  the  marketing  campaign 

The  campaign  consisted  of  a  direct  mailing  to  customers  of  the  firm.  The 
firm  is  interested  in  identifying  characteristics  that  might  explain  which 
customers  are  interested  in  the  new  product  and  which  ones  are  not.  In 
particular,  there  may  be  differences  between  male  and  female  customers 
and  between  active  and  inactive  customers  (where  active  means  that  the 
customer  already  invests  in  other  products  of  the  firm).  Also  the  age 
of  customers  may  be  of  importance,  as  relatively  young  and  relatively  old 
customers  may  have  less  interest  in  investing  in  this  product  than  middle- 
aged  people. 
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(ii)  The  data  set 

The  variable  to  be  explained  is  whether  a  customer  is  interested  in  the  new 
financial  product  or  not.  This  is  denoted  by  the  binary  variable  y„  with  y,  =  1 
if  the  zth  customer  is  interested  and  y,  =  0  otherwise.  Apart  from  a  constant 
term  (denoted  by  x\,  =  1),  the  explanatory  variables  are  gender  (denoted  by 
Xu  =  0  for  females  and  xn  =  1  for  males),  activity  (denoted  by  x^  =  1  for 
customers  that  are  already  active  investors  and  xs,  =  0  for  customers  that  do 
not  yet  invest  in  other  products  of  the  firm),  age  (in  years,  denoted  by  X4,)  and 
the  square  of  age  (divided  by  hundred,  denoted  by  x$i  =  x\j  100). 

The  data  set  considered  in  this  chapter  is  drawn  from  a  much  larger 
database  that  contains  more  than  100,000  observations.  A  sample  of  1000 
observations  is  drawn  from  this  database,  and  75  observations  are  omitted 
because  of  missing  data  (on  the  age  of  the  customer).  This  leaves  a  data  set  of 
n  =  925  customers.  Of  these  customers,  470  responded  positively  (denoted 
by  yt  —  1)  and  the  remaining  455  did  not  respond  (denoted  by  y,  —  0).  The 
original  data  set  of  more  than  100,000  observations  contains  only  around 
5000  respondents.  So  our  sample  contains  relatively  many  more  positive 
responses  (470  out  of  925)  than  the  original  database.  The  effect  of  this 
selection  is  analysed  in  Exercises  6.2  and  6.11.  For  further  background  on  the 
data  we  refer  to  the  research  report  by  P.  H.  Franses,  ‘On  the  Econometrics  of 
Modelling  Marketing  Response’,  RIBES  Report  97-15,  Rotterdam,  1997. 
This  data  set  will  be  further  analysed  in  Examples  6.2  and  6.3. 


6.1.2  Probit  and  logit  models 

Model  formulation 

The  model  (6.3)  depends  not  only  on  the  choice  of  the  explanatory  variables 
x  but  also  on  the  shape  of  the  distribution  function  F.  This  choice  corres¬ 
ponds  to  assuming  a  specific  distribution  for  the  unobserved  individual 
effects  (in  the  index  function  or  in  the  utilities)  and  it  determines  the  shape 
of  the  marginal  response  function  (6.4)  via  the  corresponding  density  func¬ 
tion  f .  In  practice  one  often  chooses  either  the  standard  normal  density 

V27i 


or  the  logistic  density 


fit )  =  Mt) 


(1  +  e{)2  ’ 
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The  model  (6.3)  with  the  standard  normal  distribution  is  called 
the  probit  model,  and  that  with  the  logistic  distribution  is  called  the  logit 
model. 

Comparison  of  probit  and  logit  model 

Both  the  standard  normal  density  and  the  logistic  density  have  mean  zero  and 
are  unimodal  and  symmetric.  The  standard  deviation  of  both  distributions  is 
fixed,  for  reasons  explained  before.  The  logistic  distribution  has  standard 
deviation  a  =  n/\/3  «  1.8,  whereas  the  standard  normal  distribution  has 
standard  deviation  1.  In  order  to  compare  the  two  models,  the  graphs  of  the 
density  <p{t)  and  the  standardized  logistic  density  aX{at)  are  given  in  Exhibit 
6.2.  This  shows  that,  as  compared  to  the  probit  model,  the  logit  model  has 
marginal  effects  (6.4)  that  are  relatively  somewhat  larger  around  the  mean 
and  in  the  tails  but  somewhat  smaller  in  the  two  regions  in  between.  There  are 
often  no  compelling  reasons  to  choose  between  the  logit  and  probit  model.  An 
advantage  of  the  logit  model  is  that  the  cumulative  distribution  function 
F  =  A  can  be  computed  explicitly,  as 


whereas  the  cumulative  distribution  function  F  =  O  of  the  probit 
model  should  be  computed  numerically  by  approximating  the  integral 


X 


Exhibit  6.2  Normal  and  logistic  densities 


Densities  of  the  standard  normal  distribution  (dashed  line)  and  of  the  logistic  distribution 
(solid  line,  scaled  so  that  both  densities  have  standard  deviation  equal  to  1).  As  compared  with 
the  normal  density,  the  logistic  density  has  larger  values  around  the  mean  (x  =  0)  and  also  in 
both  tails  (for  values  of  x  far  away  from  0). 
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®(t)  =  f  (j)(s)ds  =  1_  f  e~&  ds.  (6.6) 

J —co  v  2,71  J —oo 

In  practice  this  poses  no  real  problems,  however,  as  there  exist  very  accurate 
numerical  integration  algorithms.  In  general  the  differences  between  the  two 
models  are  not  so  large,  unless  the  tails  of  the  distributions  are  of  importance. 
This  is  the  case  when  the  choices  are  very  unbalanced,  in  the  sense  that  the 
fraction  of  individuals  with  y,  =  1  differs  considerably  from  j. 

Comparison  of  parameters  of  the  two  models:  scaling 

One  can,  of  course,  always  estimate  both  the  logit  and  the  probit  model  and 
compare  the  outcomes.  The  parameters  of  the  two  models  should  be  scaled 
for  such  a  comparison.  Instead  of  the  scaling  factor  1.8,  which  gives  the  two 
densities  the  same  variance,  one  often  uses  another  correction  factor.  The 
marginal  effects  (6.4)  of  the  explanatory  variables  are  maximal  around  zero, 
so  that  these  effects  are  of  special  interest.  As  (/>( 0)/d(0)  =  4/\/27 z  «  1.6,  the 
estimated  probit  parameters  p  can  be  multiplied  by  1.6  to  compare  them  with 
the  estimated  logit  parameters.  In  terms  of  Exhibit  6.2  this  means  that,  after 
scaling,  the  two  densities  have  the  same  function  value  in  t  =  0. 

Marginal  effects  of  explanatory  variables 

As  concerns  the  interpretation  of  the  parameters  /?,  (6.4)  shows  that  the  signs 
of  the  coefficients  pj  and  the  relative  magnitudes  Pj/ Ph  have  a  direct  inter¬ 
pretation  in  terms  of  the  sign  and  the  relative  magnitude  of  the  marginal 
effects  of  the  explanatory  variables  on  the  chance  of  success  (y,  =  1).  Since 
the  marginal  effects  depend  on  the  values  of  x„  these  effects  vary  among  the 
different  individuals.  The  effects  of  the  /th  explanatory  variable  can  be 
summarized  by  the  mean  marginal  effects  over  the  sample  of  n  individ¬ 
uals —  that  is, 


ly -dP[y,  =  l\ 
n  j-f  dxjt 


■  ,k. 


Sometimes  the  effect  at  the  mean  values  of  the  explanatory  variables 
is  reported  instead  —  that  is,  (6.4)  evaluated  at  x  =  \tY^i=\xi-  This  is  a  bit 
simpler  to  compute,  but  the  interpretation  is  somewhat  less  clear. 
When  the  /th  explanatory  variable  is  a  dummy  variable,  it  remains  possible 
to  compute  ‘marginal’  effects  in  this  way.  Instead,  it  is  also  possible  to 
compare  the  two  situations  Xj,  =  0  and  Xjj  =  1  by  comparing 
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P[y,  =  1]  =  Plxli)  ^or  individuals  with  Xji  =  0)  with  P[y;  =  1] 

=  F(Pi  +  J2¥j  Pm)  (for  individuals  with  Xjj  =  1).  This  may  reveal  differ¬ 
ences  in  the  effect  of  the  dummy  variable  for  different  ranges  of  the  other 
explanatory  variables  (xn  with  /  ^  j). 

Comparison  of  probabilities  and  the  odds  ratio 

It  may  further  be  informative  to  consider  the  predicted  probabilities 
Pi  =  P[} 'i  =  1]  =  F(x'ifi),  i  =  1,  ■  ■  ■ ,  n  —  for  instance,  the  mean,  variance, 
minimum,  and  maximum  of  these  probabilities.  The  individuals  may  also 
be  split  into  groups,  after  which  the  values  of  pi  can  be  compared  within  and 
between  groups.  Of  special  interest  is  the  odds  ratio,  which  is  defined  by 

P[y,  =  i]_  F(x'fi) 

Pbi  =  0]  1  -  F(x'fi)  ■ 

So  the  odds  ratio  is  the  relative  preference  of  option  1  as  compared  to  option 
0.  This  preference  depends  on  the  values  xt  of  the  explanatory  variables.  The 
log-odds  is  the  natural  logarithm  of  the  odds  ratio.  In  the  logit  model  with 
F  =  A  there  holds  A(t)  =  e1 /(I  +  e*)  and  1  —  A(t)  =  1/(1  +  ef),  so  that 
A(f)/(1  —  A (t))  =  el  and 


That  is,  in  the  logit  model  the  log-odds  is  a  linear  function  of  the  explanatory 
variables. 

As  a  constant  term  is  included  in  the  model,  we  can  transform  the  data 
by  measuring  all  other  explanatory  variables  {xi,---,Xk)  in  deviation 
from  their  sample  mean.  After  this  transformation,  the  odds  ratio,  evaluated 
at  the  sample  mean  of  the  explanatory  variables,  becomes  Fid]) /{l  —  Tl/^)), 
and  this  provides  the  following  interpretation  of  the  constant  term.  If  /i1  =  0, 
then  the  odds  ratio  evaluated  at  the  sample  mean  is  equal  to  1  (as  T(0)  = 
both  for  the  probit  and  for  the  logit  model),  so  that  for  an  ‘average’  individ¬ 
ual  both  choices  are  equally  likely.  If  >  0,  then  F(Pi)  >  F( 0)  =  so  that 
an  ‘average’  individual  has  a  relative  preference  for  alternative  1  above  alter¬ 
native  0,  and,  if  Pi  <  0,  an  ‘average’  individual  has  a  relative  preference  for 
alternative  0  above  alternative  1. 


=©  Exercises:  T:  6.2a-c;  S:  6.7a-c. 
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6.1.3  Estimation  and  evaluation 

The  likelihood  function 

The  logit  and  probit  models  are  non-linear  and  the  parameters  can  be 
estimated  by  maximum  likelihood.  Suppose  that  a  random  sample  of  n 
outcomes  of  the  binary  variable  y,  is  available.  If  the  probability  of  success 
is  the  same  for  all  observations  —  say,  P[y,  =  1]  =  p  —  then  the  probability 
distribution  of  the  zth  observation  is  given  by  py'(  1  —  p)l~y‘.  If  the  observa¬ 
tions  are  mutually  independent,  then  the  likelihood  function  is  given  by 
L(p)  =  U';=1py-(1  —  p)1  ~y‘  and  the  log-likelihood  by 

log (L{p))=  ^2  log (P)  +  log (!-P) 

{i;y,= 1}  {i‘,yi= 0} 

n  n 

=  ^2  y*  log  (p)  +  _  log  (i  -  p)- 

i=  1  i=  1 

Maximizing  this  with  respect  to  p  we  get  the  ML  estimator  p  =  YTi=\  yiln- 
Now  suppose  that  the  observations  y\,  ■  ■  ■ ,  y„  are  mutually  independent  but 
that  the  probability  of  success  differs  among  the  observations  according  to 
the  model  (6.3),  all  with  the  same  function  F  but  with  differences  in  the 
values  of  the  explanatory  variables  xt.  Then  the  variable  y,  follows  a  Ber¬ 
noulli  distribution  with  probability 

Pi  =  P[y .  =  1]  =  Hx’fi) 

on  the  outcome  y,  =  1  and  with  probability  (1  —  pi)  on  the  outcome  y,-  =  0. 
The  probability  distribution  is  then  given  by  p(y,)  =  pf(  1  —  p ,)1_y',  y,  =  0,  1. 
The  log-likelihood  is  therefore  equal  to 


n  n 

log (L(P))  =^y,log(p,)  +  X!l1  —  T)  l°g (1  -  pi) 

i=  1  i=  1 

n  n 

=  ^y,l°g(T(x'/f))  +  ^2  (1  ~  y^logl1  “  F{x'fi)) 

i=  1  i=  1 

=  E  log (F(^iP))+  J2  l°g(l -F(x'M  (6-7) 

{i;yi=  1}  {*;  y.-= 0} 


The  terms  pi  depend  on  fl,  but  for  simplicity  of  notation  we  will  in  the  sequel 
often  write  p,  instead  of  the  more  explicit  expression  F(x'if}). 
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Maximization  of  the  log-likelihood 

The  maximum  likelihood  estimates  are  obtained  by  solving  the  first 
order  conditions.  Using  the  fact  that  the  density  function  f(t)  is  the  derivative  of 
the  cumulative  distribution  function  F(t),  the  k  first  order  conditions  are  given  by 


^  log  (L)  ^y.dp,  ^(l-yi)d(l-pi) 

#  =— =E^+Et 
1=  1  1  1=  1 


dp 


- Pi 


i=  1  1 


i=  1 


i 


dp 

y<  -  Pt 


(6.8) 


fej  =  0. 


Here  f,  =  f(x[p )  is  the  density  function  corresponding  to  the  cumulative 
distribution  function  F.  These  first  order  conditions  can  be  seen  as  a  variation  of 
the  normal  equations  Yl'Ui  e'x‘  =  0  of  the  linear  regression  model.  In  a  binary 
response  model,  y,  —  pi  =  y ,•  —  P\y,  =  1J  is  the  residual  of  the  model  (6.3)  with 
respect  to  the  actually  observed  outcome  of  y,.  The  weighting  factor  pp  1  —  p,)  is 
equal  to  the  variance  of  y„  so  that  this  corresponds  to  the  usual  correction  for 
heteroskedasticity  in  weighted  least  squares  (see  Section  5.4.3  (p.  327-8)).  Finally, 
the  factor  fj  reflects  the  fact  that  the  marginal  effects  (6.4)  are  not  constant  over  the 
sample  (as  is  the  case  in  a  linear  regression  model)  but  depend  on  the  value  of 
f{x'jP).  The  set  of  k  non-linear  equations  g(p)  =  0  can  be  solved  numerically  —  for 
instance,  by  Newton-Raphson  —  to  give  the  estimate  b.  To  get  an  idea  of  the 
effects  of  the  different  explanatory  variables  it  can  be  helpful  to  plot  the  predicted 
probabilities  P[y  =  1]  =  F(x'b)  and  the  corresponding  odds  ratio  or  log-odds 
against  each  individual  explanatory  variable,  fixing  the  other  variables  at  their 
sample  means. 


Approximate  distribution  of  the  ML  estimator 

The  general  properties  of  ML  estimators  were  discussed  in  Section  4.3.3 
(p.  228)  —  for  instance,  large  sample  standard  errors  can  be  obtained  from  the 
inverse  of  the  information  matrix.  It  is  often  convenient  to  use  the  outer  product  of 
gradients  expression  for  this  (see  Sections  4.3.2  and  4.3.3  and  formula  (4.57) 
in  Section  4.3.8).  With  the  notation  introduced  there,  we  have  dlj/dp  = 
( y,-  —  pi)fxi/(pi(l  —  pi)),  so  that  the  covariance  matrix  of  b  can  be  estimated  by 


var(b) 


V  = 


^dhdh_ 


ST-'  (yi  —  Pi)2 

Um-h)1 


(6.9) 


where  pi  =  F(x'jb)  and  /)■  =  f(x\b).  Under  the  stated  assumptions  —  that  is,  that  the 
observations  y,  are  independently  distributed  with  P[y,  =  1]  =  F(x)P)  with  the 
same  cumulative  distribution  function  F  for  all  observations  —  the  ML  estimator 
b  has  an  asymptotic  normal  distribution  in  the  sense  that  s/n{b  —  p)  converges  in 
distribution  to  the  normal  distribution  with  mean  zero  and  covariance  matrix 
plim(tzV).  This  probability  limit  exists  under  weak  regularity  conditions  on  the 
explanatory  variables  xt.  In  finite  samples  this  gives 
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few  N(j8,V).  (6.10) 

These  results  can  be  used  to  perform  t-  and  T-tests  in  the  usual  way,  and  of  course 
the  Likelihood  Ratio  test  (4.44)  can  also  be  applied. 


Results  for  the  logit  model 

The  foregoing  expressions  apply  for  any  choice  of  the  distribution  function  F.  As  an 
illustration  we  consider  the  logit  model  with  F  =  A  in  (6.5)  in  more  detail.  The 
expression  for  the  gradient  (6.8)  simplifies  in  this  case  as 


b  = 


ex'P 

(1  +  ex'^)2 


ex'fi  /  \ 

f  +  (Sfi  \  1  +  ex‘,s) 


A,(l  -  A,), 


so  that  fi  =  pi(  1  —  pj)  in  this  case.  Therefore  the  logit  estimates  are  obtained  by 
solving  the  k  equations 

s(p)  =  iy>  -  p‘)x‘  =  ( yi  - 1  ,  0-X'n  )x-  =  °- 

i=i  i=i  k  i  -f-  e  1  / 

As  the  first  explanatory  variable  is  the  constant  term  with  X\,  =  1  for  all 
i  =  it  follows  that  X^”=i  (y»  —  pi)  =  0,  so  that 


1 

n 


1 

n 


Y.y- 


So  the  logit  model  has  the  property  that  the  average  predicted  probabilities  of 
success  and  failure  are  equal  to  the  observed  fractions  of  successes  and  failures  in 
the  sample.  The  ML  first  order  conditions  (6.11)  have  a  unique  solution,  because 
the  Hessian  matrix 


d2  log  (L)  dg(P)  _  ^  , 

dpdp'  dp'  2^r'x‘xi 


n 

1  -ppXix'i 
i=  1 


is  negative  definite.  This  simplifies  the  numerical  optimization,  and  in  general  the 
Newton-Raphson  iterations  will  converge  rather  rapidly  to  the  global  maximum. 
The  information  matrix  (for  given  values  of  the  explanatory  variables)  is  given  by 


ln  =  ~E 


d2log  (L)' 

dpdp' 


n 

^2pi(l  -  Pi)x,x'i- 

i=  1 


(6.11) 


Large  sample  standard  errors  of  the  logit  parameters  can  be  obtained,  as  discussed 
in  Section  4.3.3  (p.  228)  —  that  is,  by  substituting  the  logit  estimate  b  for  fl  in  the 
above  expression  and  by  taking  the  square  roots  of  the  diagonal  elements  of 


450  6  Qualitative  and  Limited  Dependent  Variables 


the  inverse  of  (6.11).  Expression  (6.9)  for  the  covariance  matrix  can  be  obtained 
from  (6.11)  by  replacing  the  terms  pj(l  —  pj)  in  (6.11)  by  (y,  —  pi)1,  since  for 
the  logit  model  f?  =  p}(i  —  pf),  so  that  these  terms  cancel  in  (6.9).  As 
E[(y,-  —  pi)1]  =  var(y,)  =  p,{  1  —  pj),  the  two  expressions  (6.9)  and  T ~1  of  (6.11) 
for  the  covariance  matrix  are  asymptotically  equivalent. 


E 


XM601DMF 


Remarks  on  the  probit  model 

The  analysis  of  the  probit  model  is  technically  somewhat  more  involved.  The 
Hessian  matrix  is  again  negative  definite,  and  the  numerical  optimization 
poses  no  problems  in  general.  With  suitable  software,  the  practical  usefulness 
of  probit  and  logit  models  is  very  much  alike. 

Example  6.2:  Direct  Marketing  for  Financial  Product  (continued) 

We  continue  our  analysis  of  the  direct  mailing  data  introduced  in  Example 
6.1.  We  will  discuss  (i)  the  outcomes  of  estimated  logit  and  probit  models  for 
the  probability  that  a  customer  is  interested  in  the  product,  and  (ii)  the  odds 
ratios  (depending  on  the  age  of  the  customer)  of  the  two  models. 

(i)  Outcomes  of  logit  and  probit  models 

The  dependent  variable  is  y,  with  y,-  =  1  if  the  ;th  individual  is  interested  and 
ji  =  0  otherwise.  The  explanatory  variables  are  gender,  activity,  and  age 
(with  a  linear  and  a  squared  term)  (see  Example  6.1).  The  results  of  logit 
and  probit  models  are  given  in  Panels  2  and  3  of  Exhibit  6.3.  For  comparison 
the  results  of  the  linear  probability  model  are  also  given  (see  Panel  1).  All 
models  indicate  that  the  variables  ‘gender’  and  ‘activity’  are  statistically  the 
most  significant  ones.  As  the  corresponding  two  parameters  are  positive, 
these  variables  have  a  positive  impact  on  the  probability  of  responding  to 
the  mailing.  That  is,  male  customers  and  active  customers  tend  to  be  more 
interested  than  female  and  inactive  customers.  The  effects  of  ‘gender’  and 
‘activity’  are  almost  the  same. 

The  numerical  values  of  the  coefficients  of  the  three  models  can  be  com¬ 
pared  by  determining  the  mean  marginal  effects  of  the  explanatory  variables 
in  the  three  models.  As  discussed  in  Section  6.1.2,  the  mean  marginal  effect  of 
the  /th  explanatory  variable  is  /?7  \  ^”=1  so  we  take  as  correction  factor 

ln  Y1U  f(x'jb).  For  our  data,  in  the  logit  model  this  correction  factor  is  0.230 
and  in  the  probit  model  it  is  0.373.  For  instance,  the  mean  marginal  effect  of 
the  variable  gender  is  0.224  in  the  linear  probability  model  (see  Panel  1), 
0.954  ■  0.230  =  0.219  in  the  logit  model,  and  0.588  ■  0.373  =  0.219  in  the 
probit  model.  So  the  coefficients  of  the  variable  gender  differs  in  the  three 
models  (0.224,  0.954,  0.588),  but  their  interpretation  in  terms  of  mean 
marginal  effects  is  very  much  the  same.  This  also  holds  true  for  the  coeffi¬ 
cients  of  the  other  explanatory  variables. 
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The  variable  ‘age’  has  an  effect  that  first  increases  and  then  decreases, 
although  the  effects  are  only  marginally  significant  (at  5  per  cent  significance 
level).  However,  the  possible  effect  of  age  is  of  great  practical  importance  for 
the  firm. 


Panel  1:  Dependent  Variable:  RESPONSE 

Method:  Least  Squares 

Sample:  1  1000 

Included  observations:  925;  Excluded  observations:  75 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-0.060888 

0.195906 

-0.310802 

0.7560 

GENDER 

0.224002 

0.035809 

6.255535 

0.0000 

ACTIVITY 

0.208268 

0.040669 

5.121010 

0.0000 

AGE 

0.015494 

0.007861 

1.971057 

0.0490 

AGEA2/100 

-0.015209 

0.007507 

-2.026048 

0.0430 

R-squared 

0.081542 

S.E.  of  regression 

0.480418 

Panel  2:  Dependent  Variable:  RESPONSE 

Method:  ML  -  Binary  Logit 

Sample:  1  1000 

Included  observations:  925;  Excluded  observations:  75 
Convergence  achieved  after  5  iterations 

Variable 

Coefficient 

Std.  Error 

z-Statistic 

Prob. 

C 

-2.488358 

0.889992 

-2.795932 

0.0052 

GENDER 

0.953694 

0.158183 

6.029070 

0.0000 

ACTIVITY 

0.913748 

0.184779 

4.945090 

0.0000 

AGE 

0.069945 

0.035605 

1.964455 

0.0495 

AGEA2/100 

-0.068692 

0.034096 

-2.014643 

0.0439 

S.E.  of  regression 
Log  likelihood 

0.480195 

-601.8624 

Scale  factor 

(marg.  eff.) 

0.229533 

(c) 


Exhibit  6.3 

Responses  to  direct  mailing  (1=  response,  0  =  no  response)  explained  by  gender,  activity 
dummy,  and  age  (quadratic  function).  Estimates  obtained  from  the  linear  probability  model 
(Panel  1),  the  logit  model  (Panel  2),  and  the  probit  model  (Panel  3).  The  reported  scale  factors 
are  the  averages  of  f(x'jb)  over  the  sample,  with  f  the  logistic  density  (Panel  2)  or  the  standard 
normal  density  (Panel  3). 


Panel  3:  Dependent  Variable:  RESPONSE 

Method:  ML  -  Binary  Probit 

Sample:  1  1000 

Included  observations:  925;  Excluded  observations:  75 
Convergence  achieved  after  5  iterations 

Variable 

Coefficient 

Std.  Error 

z-Statistic 

Prob. 

C 

-1.497584 

0.536822 

-2.789720 

0.0053 

GENDER 

0.588114 

0.096684 

6.082811 

0.0000 

ACTIVITY 

0.561167 

0.111572 

5.029656 

0.0000 

AGE 

0.041680 

0.021544 

1.934636 

0.0530 

AGEA2/100 

-0.040982 

0.020607 

-1.988730 

0.0467 

S.E.  of  regression 
Log  likelihood 

0.480242 
-6 01.9497 

Scale  factor 

(marg.  eff.) 

0.372705 

Direct  Marketing  for  Financial  Product  (Example  6.2) 


452  6  Qualitative  and  Limited  Dependent  Variables 


(d)  (e) 


Estimated  odds  ratios  for  logit  model  (d)  and  for  probit  model  (e)  against  age.  In  both 
diagrams,  the  top  curve  is  for  active  males,  the  second  one  for  non-active  males,  the  (nearly 
coinciding)  third  one  for  active  females,  and  the  lowest  one  for  non-active  females. 


(ii)  Odds  ratios  depending  on  age 

To  give  an  impression  of  the  age  effect,  Exhibit  6.3  shows  the  estimated  odds 
ratios  (for  the  logit  model  in  ( d )  and  for  the  probit  model  in  (e))  against  the 
variable  ‘age’.  All  odds  ratios  are  highest  around  an  age  of  50  years.  In  each 
diagram,  the  top  curve  shows  that  males  who  are  already  active  investors 
have  a  probability  of  responding  to  the  direct  mailing  that  is  two  to  three 
times  as  large  as  the  probability  of  not  responding.  The  opposite  odds  ratios 
apply  for  females  who  are  not  yet  investing.  As  the  coefficients  of  ‘gender’ 
and  ‘activity’  are  almost  equal,  the  odds  ratios  for  inactive  males  and  active 
females  coincide  approximately. 

Exercises:  T:  6. 2d,  e;  S:  6.7d-f,  6.8a,  b;  E:  6.11,  6.13a,  b. 


6.1.4  Diagnostics 

In  this  section  we  discuss  some  diagnostic  tools  for  logit  and  probit 
models  —  namely,  the  goodness  of  fit  (LR- test  and  R2),  the  predictive  quality 
(classification  table  and  hit  rate),  and  analysis  of  the  residuals  (in  particular 
an  LM-test  for  heteroskedasticity). 


6.1  Binary  response  453 


Goodness  of  fit 

The  significance  of  individual  explanatory  variables  can  be  tested  by  the 
usual  t-test  based  on  (6.10).  The  sample  size  should  be  sufficiently  large  to 
rely  on  the  asymptotic  expressions  for  the  standard  errors,  and  the  t-test 
statistic  then  follows  approximately  the  standard  normal  distribution.  Joint 
parameter  restrictions  can  be  tested  by  the  Likelihood  Ratio  test.  For  logit 
and  probit  models  it  is  no  problem  to  estimate  the  unrestricted  and  restricted 
models,  at  least  if  the  restrictions  are  not  too  involved.  The  overall  goodness 
of  fit  of  the  model  can  be  tested  by  the  LR- test  on  the  null  hypothesis  that  all 
coefficients  (except  the  constant  term)  are  zero  —  that  is,  f}2  =  •  ••  =  f$k  =  0. 
This  test  follows  (asymptotically)  the  y2(k  —  1)  distribution.  Sometimes  one 
reports  measures  similar  to  the  R2  of  linear  regression  models  —  for  instance, 
McFadden’s  R2  defined  by 


2  1  log  (^i) 

log  (To)’ 

where  L\  is  the  maximum  value  of  the  unrestricted  likelihood  function  and 
Lo  that  of  the  restricted  likelihood  function.  It  follows  from  (6.7)  that 
Lo  <  Li  <  0,  so  that  0  <  R2  <  1  and  higher  values  of  R2  correspond  to  a 
relatively  higher  overall  significance  of  the  model.  Note,  however,  that  this 
R2  cannot  be  used,  for  example,  to  choose  between  a  logit  and  a  probit 
model,  as  these  two  models  have  different  likelihood  functions. 

Predictive  quality 

Alternative  specifications  of  the  model  may  be  compared  by  evaluating 
whether  the  model  gives  a  good  classification  of  the  data  into  the  two  categor¬ 
ies  y,  =  1  and  y,  =  0.  The  estimated  model  gives  predicted  probabilities  pj  for 
the  choice  y,  =  1,  and  this  can  be  transformed  into  predicted  choices  by 
predicting  that  y,  =  1  if  pj  >  c  and  y,  =  0  if  pj  <  c.  The  choice  of  c  can 
sometimes  be  based  on  the  costs  of  misclassification.  In  practice  one  often 
takes  c  =  or,  if  the  fraction  p  of  successes  differs  much  from  50  per  cent,  one 
sometimes  takes  c  =  p.  This  leads  to  a  2  x  2  classification  table  of  the  pre¬ 
dicted  responses  y,  against  the  actually  observed  responses  y,.  The  hit  rate  is 
defined  as  the  fraction  of  correct  predictions  in  the  sample.  Formally,  let  w,  be 
the  random  variable  indicating  a  correct  prediction  —  that  is,  w,  =  1  if 
yi  =  yj  and  Wj  =  0  if  y,  ^  St  then  the  hit  rate  is  defined  by  h  =  i  YTi=  l  wi- 
In  the  population  the  fraction  of  successes  is  p.  If  we  randomly  make  the 
prediction  1  with  probability  p  and  0  with  probability  (1  —  p),  then  we  make 
a  correct  prediction  with  probability  q  =  p1  +  (1  —  p)2.  Using  the  properties 
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of  the  binomial  distribution  for  the  number  of  correct  random  predictions,  it 
follows  that  the  ‘random’  hit  rate  hr  has  expected  value  E[hr\  =  E[w\  =  q 
and  variance  var (hr)  =  var(w)/n  =  q(l  —  q)/n.  The  predictive  quality  of  our 
model  can  be  evaluated  by  comparing  our  hit  rate  b  with  the  random  hit  rate 
hr.  Under  the  null  hypothesis  that  the  predictions  of  the  model  are  no  better 
than  pure  random  predictions,  the  hit  rate  b  is  approximately  normally 
distributed  with  mean  q  and  variance  q(l  —  q)/n.  Therefore  we  reject  the 
null  hypothesis  of  random  predictions  in  favour  of  the  (one-sided)  alternative 
of  better-than-random  predictions  if 

h  —  q  nh  —  nq 

y/q{ T  -  q)/n  \fnq{l  -  q) 

is  large  enough  (larger  than  1.645  at  5  per  cent  significance  level).  In  practice, 
q  =  p2  +  (1  —  p)1  is  unknown  and  estimated  by  p1  +  (1  —  p)2,  where  p  is  the 
fraction  of  successes  in  the  sample.  In  the  above  expression  for  the  z-test,  nb 
is  the  total  number  of  correct  predictions  in  the  sample  and  nq  is  the  expected 
number  of  correct  random  predictions. 

Description  may  be  more  relevant  than  prediction 

Although  the  comparison  of  the  classification  success  of  alternative  models 
may  be  of  interest,  it  should  be  realized  that  the  parameters  of  binary  response 
models  are  chosen  to  maximize  the  likelihood  function,  and  not  directly  to 
maximize  a  measure  of  fit  between  the  observed  outcomes  y,  and  the  predicted 
outcomes  y,.  This  is  another  distinction  with  the  linear  regression  model, 
where  maximizing  the  (normal)  likelihood  function  is  equivalent  to  maximiz¬ 
ing  the  (least  squares)  fit.  A  binary  response  model  may  be  preferred  over 
another  one  because  it  gives  a  more  useful  description,  for  example,  of  the 
marginal  effects  (6.4),  even  if  it  performs  worse  in  terms  of  classification. 

Standardized  residuals  and  consequences  of  heteroskedasticity 

The  residuals  e,  of  a  binary  response  model  are  defined  as  the  differences 
between  the  observed  outcomes  y,  and  the  fitted  probabilities  pt.  As  the 
variance  of  yt  (for  given  values  of  x,)  is  p,(  1  —  pj),  the  standardized  residuals 
are  defined  by 


,*  =  y«  -  P> 
Vpii1  -  pi) 


(6.12) 


A  histogram  of  the  standardized  residuals  may  be  of  interest,  for  example, 
to  detect  outliers.  Further,  scatter  diagrams  of  these  residuals  against 
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explanatory  variables  are  useful  to  investigate  the  possible  presence  of  het- 
eroskedasticity.  Heteroskedasticity  can  be  due  to  different  kinds  of  misspe- 
cification  of  the  model.  It  may  be,  for  instance,  that  a  relevant  explanatory 
variable  is  missing  or  that  the  function  F  is  misspecified.  In  contrast  with  the 
linear  regression  model,  where  OLS  remains  consistent  under  heteroskedas¬ 
ticity,  maximum  likelihood  estimators  of  binary  response  models  become 
inconsistent  under  this  kind  of  misspecification.  For  instance,  if  the  data 
generating  process  is  a  probit  model  but  one  estimates  a  logit  model, 
then  the  estimated  parameters  and  marginal  effects  are  inconsistent  and 
the  calculated  standard  errors  are  not  correct.  However,  as  the  differences 
between  the  probit  function  O  and  the  logit  function  A  are  not  so  large, 
the  outcomes  may  still  be  reasonably  reliable.  If  one  has  doubts  on  the 
correct  choice  of  the  distribution  function  F,  it  may  be  helpful  to  compute 
the  standard  errors  in  two  ways  —  that  is,  by  the  ML  expression  (6.9)  and 
also  by  GMM  based  on  the  ‘moment’  conditions  (6.8).  If  the  two  sets 
of  computed  standard  errors  differ  significantly,  then  this  is  a  sign  of 
misspecification. 


Likelihood  Ratio  test  on  heteroskedasticity 

A  formal  test  for  heteroskedasticity  can  be  based  on  the  index  model 
y*  =  x'fi  +  Sj.  Until  now  it  was  assumed  that  the  error  terms  s,  all  follow 
the  same  distribution  (described  by  F).  As  an  alternative  we  consider  the 
model  where  all  e,/cr,  follow  the  same  distribution  F  where 

z'.y 

a,  =  er*1, 

with  Zi  a  vector  of  observed  variables.  The  constant  term  should  not  be 
included  in  this  vector  because  (as  was  discussed  in  Section  6.1.1)  the  scale 
parameter  of  a  binary  response  model  should  be  fixed,  independent  of  the 
data.  We  assume  again  that  the  density  function  f  (the  derivative  of  F)  is 
symmetric  —  that  is,  f(t)  =  f(  —  t ).  It  then  follows  that  P[y,  =  1]  =  P[y*  >  0] 
=  P[s,  >  -x'fi]  =  P[{Zi/o)  >  -x'fi/o]  =  P[(Si/(j)  <  x'fi/a]  =  F{x'fi/o)], 
so  that 


P[yi  =  l  ]  =  F(x'ip/e<v).  (6.13) 

The  null  hypothesis  of  homoskedasticity  corresponds  to  the  parameter  re¬ 
striction  Ho  :  y  =  0.  This  hypothesis  can  be  tested  by  the  LR- test.  The  unre¬ 
stricted  likelihood  function  is  obtained  from  the  log-likelihood  (6.7)  by 
replacing  the  terms  pi  =  F{x't[F)  by  p,  =  F(x'ip / e^1). 
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Lagrange  Multiplier  test  on  heteroskedasticity 

An  alternative  is  to  use  the  LM-test,  so  that  only  the  model  under  the  null 
hypothesis  (with  y  =  0)  needs  to  be  estimated.  By  working  out  the  formulas  for 
the  gradient  and  the  Hessian  of  the  unrestricted  likelihood,  it  can  be  shown  that 
the  LM-test  can  be  performed  as  if  (6.13)  were  a  non-linear  regression  model.  The 
correctness  of  the  following  steps  to  compute  the  LM-test  is  left  as  an  exercise  (see 
Exercise  6.1). 

First  estimate  the  model  without  heteroskedasticity  —  that  is,  under  the 
null  hypothesis  that  y  =  0.  This  amounts  to  estimating  the  model 
p\y,  =  1J  =  F{x'iP)  by  ML,  as  discussed  in  Section  6.1.3.  The  residuals  of  this 
model  are  denoted  by  et  =  y,  —  pi  =  y-,  —  F(xlib).  As  a  second  step,  regress  the 
residuals  c,  on  the  gradient  of  the  non-linear  model  P[y,  =  1]  =  F{x'iP  /  ez‘y),  taking 
into  account  that  the  residuals  are  heteroskedastic.  This  amounts  to  applying 
(feasible)  weighted  least  squares  —  that  is,  OLS  after  division  for  the  /th  observa¬ 
tion  by  the  (estimated)  standard  deviation.  The  variance  of  the  ‘error  term’  y,  —  pi 
is  var(y,-  —  pi)  =  var (y,-)  =  p,(  1  —  pt).  We  replace  pi  by  pj  obtained  in  the  first  step, 
so  that  the  weight  of  the  /th  observation  in  WLS  is  given  by  1/  \/pi(  1  —  pi).  Further, 
the  gradient  of  the  function  F(x'iP / ez/y)  in  the  model  (6.13),  when  evaluated  at 
y  =  0,  is  given  by 


dF(x'  p/ez'y)  „  0F(x'p/ez'y) 

— op —  =  f{xP)x’  — &y — 


-f(x'P)x'Pz. 


Therefore,  the  required  auxiliary  regression  in  this  second  step  can  be  written  in 
terms  of  the  standardized  residuals  (6.12)  as 


Vi  ~Pi  =  f(x'jb)  vlJ. 

Vpii 1  ~Pi)  Vpi(l-P') 


f{x'jb)x'jb 

\/pi(\-pi) 


z'idi  +  >1i- 


(6.14) 


Under  the  null  hypothesis  of  homoskedasticity,  there  holds  that  LM  =  nRjlc  of  this 
regression,  where  Rflc  denotes  the  non-centred  R1  —  that  is,  the  explained  sum  of 
squares  of  (6.14)  is  divided  by  the  non-centred  total  sum  of  squares  Y^i=  l  (ei*)2-  As 
the  regression  in  (6.14)  does  not  contain  a  constant  term  on  the  right-hand  side, 
one  should  take  here  the  non-centred  R1  defined  by  Rfic  =  J2  (tf)2  /  12  (ei  )2  >  where 
e*  denote  the  fitted  values  of  the  regression  in  (6.14).  We  reject  the  null  hypothesis 
for  large  values  of  the  LM-test,  and  under  the  null  hypothesis  of  homoskedasticity 
(y  =  0)  it  is  asymptotically  distributed  as  y2(g),  where  g  is  the  number  of  variables 
in  Zi  —  that  is,  the  number  of  parameters  in  y. 


This  can  be  summarized  as  follows. 
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Computation  of  LM-test  on  heteroskedasticity 

•  Step  1:  Estimate  the  restricted  model.  Estimate  the  homoskedastic  model 
Ply,  =  1]  =  F[x'tl3)  by  ML.  Let  pi  =  Fix^h)  and  define  the  generalized  re¬ 
siduals  e*  by  (6.12). 

•  Step  2:  Auxiliary  regression  of  generalized  residuals  of  step  1.  Regress  the 
generalized  residuals  e*  of  step  1  on  the  (scaled)  gradient  of  the  heteroske- 
dastic  model  P[y,  =  1]  =  F{x'tf3  /  e'F) —  that  is,  perform  OLS  in  (6.14). 

•  Step  3:  LM  =  nRflc  of  step  2.  Then  LM  =  nRjlc,  where  R2C  is  the  non- 
centred  R 2  of  the  regression  in  step  2.  If  the  null  hypothesis  of  homoske- 
dasticity  (y  =  0)  holds  true,  then  LM  fts  /2(g),  where  g  is  the  number  of 
parameters  in  y. 


Example  6.3:  Direct  Marketing  for  Financial  Product  (continued) 

We  perform  some  diagnostic  checks  on  the  logit  and  probit  models  that  were 
estimated  for  the  direct  mailing  data  in  Example  6.2.  We  will  discuss  (i)  the 
significance  of  the  explanatory  variables,  (ii)  the  investigation  of  the  possible 
presence  of  outliers  and  heteroskedasticity,  and  (iii)  the  predictive  perform¬ 
ance  of  the  models.  Exhibit  6.4  reports  the  results  of  these  diagnostic  checks. 

(i)  Significance  of  the  explanatory  variables 

In  Example  6.2  we  concluded  that  the  variables  ‘gender’  and  ‘activity’  are 
significant  but  that  the  linear  and  quadratic  age  variables  are  individually 
only  marginally  significant.  Panel  1  of  Exhibit  6.4  contains  the  result  of  the 
LR-te st  for  the  joint  significance  of  the  two  age  variables.  This  indicates  that 
they  are  jointly  not  significant,  as  P  =  0.12  in  the  logit  model  and  P  =  0.13  in 
the  probit  model.  The  two  models  have  nearly  equal  and  not  so  large  values 
of  R2  (0.061),  but  the  LR- test  for  the  joint  significance  of  the  variables 
(x2,  ■  ■  ■ ,  x$)  in  Panel  1  of  Exhibit  6.4  shows  that  the  models  have  explanatory 
power.  The  combination  of  statistical  significance  with  relatively  low  fit  is 
typical  for  models  explaining  individual  behaviour.  This  means  that  the 
model  may  have  difficulty  in  describing  individual  decisions  but  that  it 
gives  insight  into  the  overall  pattern  of  behaviour. 

(ii)  Investigation  of  possible  outliers  and  heteroskedasticity 

The  maximum  and  minimum  values  of  the  standardized  residuals  reported  in 
Panel  1  of  Exhibit  6.4  for  the  logit  and  probit  model  do  not  indicate  the 
presence  of  outliers.  To  test  for  the  possible  presence  of  heteroskedasticity, 
we  consider  the  model  cy  =  eyZi,  where  Z,  is  the  total  amount  of  money  that 
individual  i  has  already  invested  in  other  products  of  the  bank.  The  test 
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outcomes  provide  some  evidence  for  the  presence  of  heteroskedasticity 
(P  =  0.01). 

(iii)  Predictive  performance 

Panels  2  and  3  of  Exhibit  6.4  also  contain  results  on  the  predictive  perform¬ 
ance  of  the  logit  and  probit  models.  The  hit  rates  are  0.616  for  the  logit 
model  and  0.622  for  the  probit  model.  This  is  well  above  the  expected  hit 
rate  of  around  0.5  of  random  predictions  (more  precisely,  of  the  925  obser¬ 
vations  there  are  470  with  y,  =  1  and  455  with  y,  =  0  so  that  p  =  ^  and 
p1  +  (1  —  p)2  =  0.5001).  The  test  whether  the  predictions  are  better  than 


Panel  1:  DIAGNOSTIC  TEST  RESULTS 

LOGIT 

PROBIT 

Standardized  residuals:  maximum 

2.123 

2.135 

minimum 

-1.786 

-1.787 

Heteroskedasticity  LM  test  value  (df  =  1) 

6.237 

6.186 

corresponding  P-value 

0.0125 

0.0129 

LR  test  for  significance  of  explanatory  variables  (df  =  4) 

78.35 

78.18 

corresponding  P-value 

0.0000 

0.0000 

LR  test  for  significance  of  age  variables  (df  =  2) 

4.247 

4.089 

corresponding  P-value 

0.1196 

0.1294 

R-squared 

0.061 

0.061 

Panel  2:  LOGIT:  Prediction  Evaluation  (success  cutoff  C  =  0.5) 

Estimated  Equation 

Dep  =  0  Dep  =  1  Total 

Constant  Probability 

Dep  =  0  Dep  =  1  Total 

P(Dep  =  1)<=C 
P(Dep=l)>C 
Total 

196  96  292 

259  374  633 

455  470  925 

0  0  0 

455  470  925 

455  470  925 

Correct 
%  Correct 
%  Incorrect 

196  374  570 

43.08  79.57  61.62 

56.92  20.43  38.38 

0  470  470 

0.00  100.00  50.81 

100.00  0.00  49.19 

p  =  470/925  =  0.508,  random  hit  rate  p2  +  (1  —  p)z  =  0.5001 

Z-value  =  (570  -  462.5)/v/(925*0.5001*0.4999)  =  7.07,  P  =  0.0000 

Panel  3:  PROBIT:  Prediction  Evaluation  (success  cutoff  C  =  0.5) 

Estimated  Equation 

Dep  =  0  Dep  =  1  Total 

Constant  Probability 

Dep  =  0  Dep  =  1  Total 

P(Dep=l)<=C 

P(Dep=l)>C 

Total 

190  85  275 

265  385  650 

455  470  925 

0  0  0 

455  470  925 

455  470  925 

Correct 
%  Correct 
%  Incorrect 

190  385  575 

41.76  81.91  62.16 

58.24  18.09  37.84 

0  470  470 

0.00  100.00  50.81 

100.00  0.00  49.19 

p  =  470/925  =  0.508,  random  hit  rate  p2  +  (1  -  p)2  =  0.5001 

Z-value  =  (575  -  462.5)/v/(925*0.5001*0.4999)  =  7.40,  P  =  0.0000 

Exhibit  6.4  Direct  Marketing  for  Financial  Product  (Example  6.3) 

Outcomes  of  various  diagnostic  tests  for  logit  and  probit  models  for  responses  to  direct  mailing 
(Panel  1)  and  predictive  performance  of  logit  model  (Panel  2)  and  of  probit  model  (Panel  3). 
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random  gives  values  of  z  =  7.07  for  the  logit  model  and  z  =  7.40  for  the 
probit  model,  with  P-value  P  =  0.00  (see  Panels  2  and  3  of  Exhibit  6.4).  This 
shows  that  the  classification  of  respondents  by  the  logit  and  probit  models  is 
better  than  what  would  have  been  achieved  by  random  predictions.  The 
models  are  more  successful  in  predicting  positive  responses  (around  80  per 
cent  is  predicted  correctly)  than  in  predicting  no  response  (of  which  a  bit 
more  than  40  per  cent  is  predicted  correctly). 

Exercises:  T:  6.1;  S:  6. 8c— f;  E:  6.10,  6.13c,  d. 


6.1.5  Model  for  grouped  data 

Grouped  data 

Sometimes  —  for  instance,  for  reasons  of  confidentiality  —  the  individual 
data  are  not  given  and  only  the  average  values  of  the  variables  over  groups 
of  individuals  are  reported.  For  instance,  the  investment  decisions  of  custom¬ 
ers  of  a  bank  may  be  averaged  over  residential  areas  (zip  codes)  or  over  age 
groups.  Suppose  that  the  individual  data  satisfy  the  binary  response  model 
(6.3) — that  is,  P[y,  =  1  j  =  F(x)P)  with  the  same  function  P  for  all 
i  =  1,  ■  •  • ,  n.  Let  the  data  be  grouped  into  G  groups,  with  n.j  individuals  in 
group  The  groups  should  be  chosen  so  that  the  values  of  the  explanatory 
variables  x  are  reasonably  constant  within  each  group.  Let  x1  denote  the 
vector  of  group  means  of  the  explanatory  variables  for  the  nt  individuals  in 
this  group.  Let  y-  be  the  fraction  of  individuals  in  group  j  that  have  chosen 
alternative  1,  so  that  a  fraction  1  —  y-  has  chosen  the  alternative  0.  The  data 
consist  of  the  G  values  of  (y;,  Xj),  and  the  group  sizes  «7  are  assumed  to  be 
known,  j  =  1,  ■  ■  ■ ,  G. 


Estimation  by  maximum  likelihood 

It  is  assumed  that  Xj  is  a  close  enough  approximation  of  the  characteristics  of 
all  individuals  in  group  /  so  that  their  probabilities  to  choose  alternative 
1  are  constant  and  given  by  pj  =  F(x'-P).  Then  the  joint  contribution  of  the 
individuals  in  group  /  to  the  log-likelihood  (6.7)  is  given  by  log  (pj)  + 
(rtj  —  tij\ )  log  ( 1  —  pj),  where  n.j\  =  n;y;  is  the  number  of  individuals  in  group  / 
that  chooses  alternative  1.  So,  in  terms  of  the  observed  fractions  y;,  the  log- 
likelihood  becomes 


l°g  (L)  =  Y  tij  y j  log  (pj)  +  (1  -  Jj)  log  (1  -  pj)) . 

7=1 


(6.15) 
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It  is  required  that  k  <  G  —  that  is,  the  number  of  explanatory  variables  may 
not  be  larger  than  the  number  of  groups.  The  model  parameters  can  be 
estimated  by  maximum  likelihood,  much  in  the  same  way  as  was  discussed 
in  Section  6.1.3  for  the  case  of  individual  binary  response  data.  The  model 
imposes  restrictions  if  the  decisions  of  G  groups  are  modelled  in  terms  of 
k  <  G  parameters.  To  test  the  specification  of  the  model,  one  can  consider  as 
an  alternative  the  model  that  contains  a  dummy  for  each  group,  that  is,  with 
pj  =  F(5j)  for  j  =  1,  •  ■  ■ ,  G.  This  model  contains  G  parameters  and  allows  for 
arbitrary  different  specific  probabilities  for  each  group.  The  corresponding 
maximum  likelihood  estimates  are  given  by  8j  =  y;.  The  model  pj  =  F(x'/1) 
imposes  (G  —  k)  parameter  restrictions  d;  =  x'fi.  This  can  be  tested  by  the 
LR- test  that  follows  a  x2(G  —  k)  distribution  under  the  null  hypothesis  of 
correct  specification. 


Estimation  by  feasible  weighted  least  squares 

Instead  of  using  the  above  maximum  likelihood  approach,  one  can  also  use 
feasible  weighted  least  squares  (FWLS)  to  estimate  the  parameters  p.  This  is 
based  on  the  fact  that  y.  is  the  sample  mean  of  nt  independent  drawings  from 
the  Bernoulli  distribution  with  mean  pj  and  variance  p/(  1  —  pj).  If  «,  is  sufficiently 
large,  it  follows  from  the  central  limit  theorem  that 


y> 


P/(l  -Pj)\ 

n,  ) 


If  F  is  continuous  and  monotonically  increasing  (as  is  the  case  for  logit  and  probit 
models),  then  the  inverse  function  F exists.  We  define  transformed  observations 


=  F-Hyj)- 

Using  the  facts  that  F~l(pj)  =  x'jP  and  that  P  1  (p)  has  derivative  1  /f(F^1(p)),  it 
follows  that  in  large  enough  samples 


Zj 


P/(l  ~Pi)\ 

"ft  y 


where  fj  =  f{x'-p).  This  can  be  written  as  a  regression  equation 

Zj  XjP  Ej,  j  1,  •  •  •  ,  G. 

Here  the  error  terms  e;-  are  independent  and  approximately  normally  distributed 
with  mean  zero  and  variances  aj  =  pj(l  —  pj)/ {njff).  So  the  error  terms  are 
heteroskedastic.  Then  p  can  be  estimated  by  FWLS  —  for  instance,  as  follows.  In 
the  first  step  P  is  estimated  by  OLS,  regressing  z,  on  xt  for  the  G  groups.  Let  b  be 
the  OLS  estimate;  then  the  variance  erf  of  e;-  can  be  estimated  by  replacing  pj  by 
pj  =  F(x'jb)  and  fj  by  f(x'jb),  so  that 
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5,2  =  Pi( 1  ~Pi)/(nif2(Xjb)). 

In  the  second  step  is  estimated  by  WLS,  using  the  estimated  standard  deviations 
of  Sj  to  obtain  the  appropriate  weighting  factors.  That  is,  in  the  second  step  OLS  is 
applied  in  the  transformed  model 

—  =  f—  P  +  UJj,  j  =  1,  •  •  • ,  G. 
si  Vs/'  / 


FWLS  in  the  logit  model 

We  specify  the  above  general  method  in  more  detail  for  the  logit  model.  In 
this  case  the  required  regressions  simplify  somewhat,  because  the  logit  model 
has  the  property  that  fj  =  2,  =  A;(1  —  A;)  =  pj(\  —  pj)  (see  Section  6.1.3).  So 
sj  =  l/(rijpj(l  —  p/))  and  the  FWLS  estimates  are  obtained  by  performing  OLS 
in  the  following  regression  model. 


J tijpjil  -  pj)  Zj  =  Jnjppi-pj)  x'fi  +  ujh  pj  =  A (Xjb)  =  - - -jr  ■ 

v  v  1  +  e  xi° 

So,  for  the  logit  model  the  FWLS  estimates  are  obtained  by  regressing  Zj  on  x7 
(with  OLS  estimate  b)  followed  by  a  regression  of  iVjZj  on  wpCj  with  weights 


=  \]njpj{\  -  pj)  =y/nj- 


g  2X,° 


+  e~x> 


x'b  ~  V 


e^'ib 

1  +  ex'<b ' 


FWLS  is  asymptotically  equivalent  to  ML  (see  also  Section  5.4.4  (p.  336)). 
However,  if  is  relatively  small  for  some  groups,  it  may  be  preferable  to  use 
ML.  An  example  using  grouped  data  is  left  as  an  exercise:  see  Exercise  6.12,  which 
considers  the  direct  mailing  data  averaged  over  ten  age  groups. 


“S3  Exercises:  E:  6.12. 


6.1 .6  Summary 

To  model  the  underlying  factors  that  influence  the  outcome  of  a  binary 
dependent  variable  we  take  the  following  steps. 

•  Determine  the  possibly  relevant  explanatory  variables  and  formulate  a 
model  of  the  form  P[y  =  1]  =  F(x'P),  where  y  is  the  dependent  variable 
(with  possible  outcomes  0  and  1)  and  x  is  the  vector  of  explanatory 
variables.  The  function  F  is  chosen  as  a  cumulative  distribution  func¬ 
tion,  in  most  cases  F  =  A  of  (6.5)  (the  logit  model)  or  F  =  O  of  (6.6)  (the 
probit  model). 
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•  Estimate  the  parameters  /?  of  the  model  by  maximum  likelihood.  For 
logit  and  probit  models,  the  required  non-linear  optimization  can  be 
solved  without  any  problems  by  standard  numerical  methods. 

•  The  estimated  model  can  be  interpreted  in  terms  of  the  signs  and 
significance  of  the  estimated  coefficients  /f  and  in  terms  of  the  mean 
marginal  effects  and  odds  ratios  discussed  in  Section  6.1.2. 

•  The  model  can  be  evaluated  in  different  ways,  by  diagnostic  tests 
(standardized  residuals,  test  on  heteroskedasticity)  and  by  measuring 
the  model  quality  (goodness  of  fit  and  predictive  performance). 

•  The  approach  for  grouped  (instead  of  individual)  data  is  similar;  the 
main  distinction  is  that  the  log-likelihood  is  now  given  by  (6.15)  instead 
of  (6.7). 
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6.2  Multinomial  data 


6.2.1  Unordered  response 

Uses  Chapters  1-4;  Section  5.4;  Section  6.1. 


Multinomial  data 

When  the  dependent  variable  has  a  finite  number  of  possible  outcomes,  the 
data  are  called  multinomial.  This  occurs,  for  instance,  when  individuals  can 
choose  among  more  than  two  options.  In  some  cases  the  options  can  be 
ordered  (for  example,  how  much  one  agrees  or  disagrees  with  a  statement), 
in  other  cases  the  different  options  are  unordered  (for  example  the  choice 
of  travel  mode  for  urban  commuters).  In  this  section  and  the  next  one 
we  discuss  models  for  unordered  data  and  in  Section  6.2.3  we  consider 
ordered  data. 

Multinomial  model  for  individual-specific  data 

Let  m  be  the  number  of  alternatives.  These  alternatives  (for  example,  to 
travel  by  bicycle,  bus,  car  or  train)  are  supposed  to  have  no  natural  ordering. 
However,  for  ease  of  reference  the  alternatives  are  labeled  by  an  index 
/  =  1,  ■  ■  ■ ,  m,  so  that  the  response  y,  =  j  is  a  nominal  (not  an  ordinal)  vari¬ 
able.  Let  nj  be  the  number  of  observations  with  response  yt  =  j  and  let 
n  =  Y™=\  nj  be  the  total  number  of  observations.  Suppose  that,  apart  from 
the  choices  y„  also  the  values  x,  of  k  explanatory  variables  are  observed, 
i  =  1,  •  •  • ,  n.  The  first  element  of  x,  is  the  constant  term  X\ ,•  =  1,  and  the  other 
elements  of  x,  represent  characteristics  of  the  ith  individual.  A  possible  model 
in  terms  of  stochastic  utilities  is  given  by 

u\  =  u,j  +  s,j  =  x'lPj  +  £,j.  (6.16) 

Here  x,  is  a  k  x  1  vector  of  explanatory  variables  for  individual  i  and  pj  is 
alxl  vector  of  parameters  for  alternative  Further,  Ujj  =  x'/f;  represents 
the  systematic  utility  of  alternative  j  for  an  individual  with  characteristics  x„ 
and  Pj  measures  the  relative  weights  of  the  characteristics  in  the  derived 
utility.  The  differences  between  the  alternatives  are  modelled  by  differences 
in  the  weights,  and  pjt  —  p w  measures  the  marginal  increase  of  the  utility  of 


464 


6  Qualitative  and  Limited  Dependent  Variables 


alternative  /  as  compared  to  alternative  h  when  the  /th  explanatory  variable 
raises  by  one  unit.  The  terms  £,y  are  individual-specific  and  represent  unmo¬ 
delled  factors  in  individual  preferences. 

The  model  (6.16)  is  called  the  multinomial  model.  This  model  can  be  used 
if  data  are  available  on  the  individual-specific  values  of  the  k  explanatory 
variables  x„  i  =  1  ,■■■,«,  and  there  is  no  data  information  on  the  character¬ 
istics  of  the  alternatives  j  =  1,  •  •  • ,  m.  The  differences  between  the  alterna¬ 
tives  are  modelled  by  the  unknown  ^  x  1  parameter  vectors  /?•,/  =  1,  ■  ■  ■ ,  m. 

Conditional  model  for  individual-  and  alternative-specific  data 

Another  type  of  model  is  obtained  when  aspects  of  the  alternatives  are 
measured  for  each  individual  —  for  example,  the  travel  times  for  alternative 
transport  modes.  Tet  x,j  be  the  vector  of  values  of  the  explanatory  variables 
that  apply  for  individual  i  and  alternative  j.  A  possible  model  for  the 
utilities  is 


U’j  —  Ujj  +  Gjj  —  x'jjfi  +  Sjj,  (6.17) 

where  xi;  and  /?  are  m  x  1  vectors.  This  is  called  the  conditional  model.  This 
model  can  be  used  if  relevant  characteristics  x,j  of  the  m  alternatives  can  be 
measured  for  the  n  individuals.  The  difference  with  the  multinomial  model 
(6.16)  is  that  the  differences  between  the  alternatives  /  and  h  are  measured 
now  by  (x,j  —  x,j, ),  which  may  vary  between  individuals,  whereas  in  (6.16) 
these  differences  are  (/i;  —  /?/,),  which  are  unknown  and  the  same  for  all 
individuals. 

Choice  model  and  log-likelihood 

Both  in  the  multinomial  model  and  in  the  conditional  model,  it  is  assumed 
that  the  zth  individual  chooses  the  alternative  j  for  which  the  utility  U\  is 
maximal.  It  then  follows  that 

Pij  =  P[y,  =  j ]  =  P[u„  +  Sjj  >  uih  +  Sjh  for  all  h  ±  ;'],  (6.18) 

where  utj  =  x'fij  or  ul}  =  x!t]ft  depending  on  which  of  the  two  above  models  is 
chosen. 

In  order  to  estimate  the  parameters,  the  joint  distribution  of  the  terms  Sjj 
has  to  be  specified.  It  is  assumed  that  (conditional  on  the  given  values  of 
the  explanatory  variables)  the  individuals  make  independent  choices,  so  that 
Sjj  and  sgh  are  independent  for  all  i  ^  g  and  all  ;,  h  =  1,  •  •  • ,  m.  The  log- 
likelihood  can  then  be  written  as  follows,  where  y,j  =  1  if  y,  =  /  and  y(/  =  0 
otherwise,  and  where  p,j  —  piyi  for  the  actually  chosen  alternative  j  =  y,. 
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n  m  n 

log  (L)  =  EE  y,j  log  (pij )  =  log  (Phi )  •  ( 6  ■ 1 9 ) 

i=  1  ;=1  /=1 

So  this  consists  of  the  sum  over  the  zz  terms  log  (pij),  where  j  =  y,  is  the 
alternative  chosen  by  individual  i.  For  the  binary  choice  model  with  m  =  2 
alternatives,  (6.19)  reduces  to  the  log-likelihood  (6.7). 

Multinomial  and  conditional  probit  models 

The  ML  estimates  of  the  parameters  of  the  model  can  be  obtained  by 
maximizing  (6.19)  after  the  joint  distribution  of  the  terms  e,y,  /  =  1,  •  •  • ,  m, 
in  (6.18)  has  been  specified.  For  example,  suppose  that  these  terms  are  jointly 
normally  distributed  with  mean  zero  and  (unknown)  m  x  m  covariance 
matrix  V,  so  that 


(  6jl  N 

:  ~  NID(0,  V). 

\  ''!rr!  ) 


If  Ujj  =  x’flj,  as  in  the  multinomial  model  (6.16),  then  the  model  (6.18)  for  the 
choice  probabilities  pl}  with  e,-  ~  NID(0,  V)  is  called  the  multinomial  probit 
model.  And  if  «,y  =  x'-fl,  as  in  the  conditional  model  (6.17),  then  (6.18)  with 
Si  ~  NID(0,  V)  is  called  the  conditional  probit  model.  An  important  advan¬ 
tage  of  incorporating  the  covariance  matrix  V  in  the  model  is  the  following. 
When  two  alternatives  /  and  h  are  perceived  as  being  close  together,  then  a 
typical  preference  e,7  =  U\  —  iijj  >  0  (meaning  that  the  z'th  individual  derives  a 
larger  utility  from  alternative  j  than  is  usual  for  individuals  with  the  same 
values  of  the  explanatory  variables)  will  mostly  correspond  to  a  preference 
So,  =  Uht  —  Ufa  >  0  as  well.  That  is,  if  in  the  multinomial  probit  model  /1;  ~ 
or  in  the  conditional  probit  model  x,j  ~  Xji,  (so  that  the  utilities  derived  from 
the  alternatives  /  and  b  are  close  together),  then  it  may  be  expected  that  e,7 
and  Sjf,  are  positively  correlated.  Such  correlations  can  be  modelled  by  the 
off-diagonal  elements  (j,  h)  of  the  covariance  matrix  V. 


Estimation  of  multinomial  and  conditional  probit  models 

The  multinomial  and  conditional  probit  models  can  be  estimated  by  ML.  For  fixed 
values  of  the  parameters,  the  log-likelihood  (6.19)  can  be  evaluated  by  numerical 
integration  of  the  probabilities  p,7  in  (6.18).  As  the  probability  p,j  involves  the 
(m  —  1)  conditions  £,y  —  e,/,  >  —  Ujj  (for  b  ^  j),  this  probability  is  expressed  as 

an  (m  —  1)  dimensional  integral  in  terms  of  the  (m  —  f)  random  variables 
(. £«/  —  so,).  The  evaluation  of  this  integral  (for  given  values  of  (u^  —  u^),  that  is, 
of  x'dPh  ~  Pj)  in  the  multinomial  model  and  of  (x,h  —  x,j)' [i  in  the  conditional 
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model)  requires  appropriate  numerical  integration  techniques.  Numerically  sim¬ 
pler  likelihood  functions  can  be  obtained  by  choosing  other  distributions  for  the 
error  terms  £;,,  as  is  discussed  in  the  next  section. 


Parameter  restrictions  needed  for  identification 

Some  parameter  restrictions  have  to  be  imposed,  as  the  probabilities  p,j 
depend  only  on  the  differences  —  U;j  and  as  they  are  invariant 
under  multiplication  of  the  utilities  U\  by  a  constant.  The  last  problem  can 
be  solved  by  fixing  one  of  the  variances  —  for  instance,  by  setting  E[ e?j]  = 
Vn  =  1.  Further,  for  the  multinomial  probit  model  u,(,  —  u,j  =  x'PPy  —  fy),  so 
that  one  of  the  parameter  vectors  can  be  chosen  arbitrarily  —  for  instance, 
Pi  =  0,  which  corresponds  to  choosing  the  first  alternative  as  reference.  In 
the  conditional  probit  model  —  Uij  =  (xjh  —  Xij)'P,  so  that  the  vector  of 
explanatory  variables  should  not  include  a  constant  term  in  this  case. 


6.2.2  Multinomial  and  conditional  logit 

Model  formulation 

Although  multinomial  and  conditional  probit  models  can  be  estimated 
by  suitable  numerical  integration  methods,  it  is  in  practice  often  preferred 
to  use  simpler  models.  A  considerable  simplification  is  obtained  by  assuming 
that  all  the  mn  error  terms  £//  are  independently  and  identically  distributed 
(for  all  individuals  and  all  alternatives)  with  the  so-called  extreme  value 
distribution.  It  can  be  shown  (see  Exercise  6.3)  that  in  this  case  the  multi¬ 
nomial  and  the  conditional  probabilities  in  (6.18)  become 


multinomial  logit: 


conditional  logit: 


Pa  = ' 


EZU  **  1  +  E 


h= 2 


axiPh 


Pa  = 


x'-R 

e 


Efc= i e 


yj 


(6.20) 


For  the  multinomial  model  we  used  the  identification  convention  to  choose 
Pi  =  0  for  the  first  (reference)  category.  The  first  model  for  the  choice 
probabilities  p,j  is  called  the  multinomial  logit  model,  the  second  model  the 
conditional  logit  model.  For  the  case  of  m  =  2  alternatives,  both  models  boil 
down  to  a  binary  logit  model.  Indeed,  for  the  multinomial  model  we  get 
pa  =  ex^2/{l  +  ex^2),  which  is  a  binary  logit  model  with  parameter  vector 
P  =  P2-  In  the  conditional  logit  model  we  get  for  m  =  2  that 
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Pa 


ex'J  +  e^aP 


e(xa-xn  )'P 

l  _|_  e(Xi2-Xn)'P  ’ 


which  is  a  binary  logit  model  with  explanatory  variables  x,  =  xa  —  x,\  • 


Estimation  of  the  multinomial  logit  model 

The  multinomial  logit  (MNL)  model  can  be  estimated  by  maximum  likelihood  — 
that  is,  by  maximizing  (6.19)  with  respect  to  the  parameters  fij,  ;  =  2,  ■  ■  ■  ,m.  It  is 
left  as  an  exercise  (see  Exercise  6.3)  to  show  the  following  results.  If  we  substitute 
(6.20)  in  (6.19),  the  log-likelihood  becomes 


n 


m 


log  (LMNl(Pi,  ■■■,  Pm))  = 

,=1  \j= 2 


log 


(6.21) 


The  gradient  of  the  log-likelihood  consists  of  the  (m  —  1)  stacked  k  x  1  vectors 


<91og(LMNL) 

dPh 


=  ^2,(yih~Pih)xi,  h  =  2, 


with  as  specified  above  for  the  multinomial  model.  Further,  the 

(m  —  l)k  x  (m  —  l)k  Hessian  matrix  is  negative  definite  with  kxk  blocks 
—  S/Li  Pihfl  ~  Pih)xix'i  on  the  diagonal  (h  =  2,---,m)  and  kxk  blocks 
E"=i  PihPigXtx'i  off  the  diagonal  (g,  h  =  2,  •  •  • ,m,  g  ±  h). 


Estimation  of  the  conditional  logit  model 

For  the  conditional  logit  (CL)  model  the  results  are  as  follows  (see  Exercise  6.3). 
The  log-likelihood  is  given  by 


n  /  m 


log  (Lcl(P))  =  X!  H  y”x''iP  ~~ log  eX',hls 

i=  1  \/=l  \h= 1  7 


The  gradient  of  the  log-likelihood  is 


dlog(Lci,) 

dp 


—  'y 'y (y>;  pij)xij- 


1=1  /= 1 


Finally,  the  Hessian  is  -  E'=i  E,=i  PijXpix-p  -  E”=i  PihXjh )  • 


(6.22) 
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Numerical  aspects 

The  first  order  conditions  for  a  maximum  can  be  solved  numerically  —  for  in¬ 
stance,  by  using  the  above  expressions  for  the  gradient  and  the  Hessian  in  the 
Newton-Raphson  algorithm.  In  both  the  multinomial  and  the  conditional  logit 
model  the  Hessian  matrix  is  negative  definite,  so  that  in  general  the  iterations 
converge  relatively  fast  to  the  global  maximum.  As  usual,  approximate  stand¬ 
ard  errors  of  the  ML  estimates  can  be  obtained  from  the  inverse  of  the  Hessian 
matrix. 


Marginal  effects  of  explanatory  variables 

The  parameters  of  the  model  can  be  interpreted  in  terms  of  the  marginal 
effects  of  the  explanatory  variables  on  the  choice  probabilities.  The  following 
results  are  left  as  an  exercise  (see  Exercise  6.3).  In  the  multinomial  logit 
model,  the  k  x  1  vector  of  marginal  effects  is  given  by 


dPMNLbi  =  /] 
Ox, 


Pij  U-E  PihPh 


h= 2 


In  the  conditional  logit  model  the  marginal  effects  are 


(6.23) 


dPcdy,  =  i  1 

dxij 


Pij( 1  Pij)  Pi 


dPcdyi  =  j] 

dx,h 


PijPthP  for  h  ^  /'. 


Note  that,  in  the  multinomial  logit  model,  all  the  parameters  Pi,, 
h  =  2,  •  •  • ,  m,  together  determine  the  marginal  effect  of  x,  on  the  probability 
to  choose  the  /th  alternative.  It  may  even  be  the  case  that  the  marginal  effect 
of  the  /th  variable  x/,-  on  P[y,  =  /']  has  the  opposite  sign  of  the  parameter  /?  y  So 
the  sign  of  the  parameter  /fy  cannot  always  be  interpreted  directly  as  the  sign 
of  the  effect  of  the  /th  explanatory  variable  on  the  probability  to  choose  the 
/th  alternative.  Therefore  the  individual  parameters  of  a  multinomial  logit 
model  do  not  always  have  an  easy  direct  interpretation.  On  the  other  hand,  in 
the  conditional  logit  model  the  sign  of  Pi  is  equal  to  the  sign  of  the  marginal 
effect  of  the  /th  explanatory  variable  (x;7-  /)  on  the  probability  to  choose  each 
alternative  since  0  <  p,j(l  —  pij)  <  1. 


Odds  ratios  and  the  ‘independence  of  irrelevant  alternatives’ 

The  above  multinomial  and  conditional  logit  models  are  based  on  the  as¬ 
sumption  that  the  error  terms  £,/  are  independent  not  only  among  different 
individuals  i  but  also  among  the  different  alternatives  j.  That  is,  the  unmo¬ 
delled  individual  preferences  e,y  of  a  given  individual  i  are  independent  for 
the  different  alternatives  This  requires  that  the  alternatives  should  be 
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sufficiently  different  from  each  other.  This  can  be  further  clarified  by  con¬ 
sidering  the  log-odds  between  two  alternatives  j  and  h.  In  the  multinomial 
and  conditional  logit  model,  the  log-odds  for  alternatives  j  and  h  is  given 
respectively  by 


(  PMNLlyi  =  j]  \ 
V  MNL[y;  =  h] ) 


log 


f  Pcdy,  =  /]  \ 
\Pcdyi  =  h]  J 


x'i(Pj  ~  Ph), 


(Xjj  Xjh )  [>  ■ 


So  the  relative  odds  to  choose  between  the  alternatives  /  and  h  is  not  affected 
by  the  other  alternatives.  This  property  of  the  multinomial  and  conditional 
logit  model  is  called  the  ‘ independence  of  irrelevant  alternatives'.  That  is,  in 
comparing  the  alternatives  j  and  h,  the  other  options  are  irrelevant.  As  an 
example,  suppose  that  consumers  can  choose  between  ten  brands  of  a  certain 
product,  with  two  strong  leading  brands  (/  =  1,2)  and  with  eight  other  much 
smaller  brands.  Suppose  that  the  owner  of  the  first  leading  brand  is  interested 
in  the  odds  of  his  product  compared  with  the  other  leading  brand  —  that  is, 
in  log  (P[yj  =  1  \/P[yi  =  2]).  Clearly,  it  should  make  little  difference  whether 
this  is  modelled  as  a  choice  between  ten  alternative  brands  or  as  a  choice 
between  three  alternatives  (the  two  leading  brands  and  the  rest,  taken  as  one 
category).  In  such  situations  the  ‘independence  of  irrelevant  alternatives’  is  a 
reasonable  assumption.  The  odds  ratio  between  two  alternatives  then  does 
not  change  when  other  alternatives  are  added  to  or  deleted  from  the  model. 
In  other  situations,  especially  if  some  of  the  alternatives  are  very  similar,  the 
independence  of  irrelevant  alternatives  is  not  realistic,  so  that  the  discussed 
logit  models  are  not  appropriate.  In  this  case  it  is  better  to  use  multinomial  or 
conditional  probit  models  to  incorporate  the  dependencies  between  the  error 
terms  for  the  different  alternatives  /. 

Diagnostic  tests 

One  can  apply  similar  diagnostic  checks  on  multinomial  and  conditional 
models,  as  discussed  before  in  Section  6.1.4  for  binary  models.  For  instance, 
the  overall  significance  of  the  model  can  again  be  tested  by  means  of  the 
likelihood  ratio  test  on  the  null  hypothesis  that  all  parameters  are  zero.  One 
can  further  evaluate  the  success  of  classification  —  for  instance,  by  predicting 
that  the  zth  individual  chooses  the  alternative  h  for  which  p,h  is  maximal. 
These  predicted  choices  can  be  compared  with  the  actual  observed  choices  y, 
in  an  m  x  m  classification  table.  Let  n jj  be  the  number  of  individuals  for 
which  y,  =  y,  =  j  is  predicted  correctly,  and  let  pn  =  njj/n.  Then  b  —  Yfy=  \  Pa 
is  the  hit  rate  —  that  is,  the  success  rate  of  the  model  predictions.  This  may  be 
compared  to  random  predictions,  where  for  each  individual  the  alternative  j 
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is  predicted  with  probability  pj  =  rij/n,  the  observed  fractions  in  the  sample. 
The  expected  hit  rate  of  these  random  predictions  is  q  =  Yl'JL  1 P]  ■  The  model 
provides  better-than-random  predictions  if 

h  —  q  nh  —  nq 

1  -  q)/n  \f  nq{\  -  q) 

is  large  enough  (larger  than  1.645  at  5  per  cent  significance  level).  An  LR- test 
for  heteroskedasticity  may  be  performed,  for  instance,  by  specifying  a  model 
for  the  error  terms  in  the  utility  function  (6.16)  or  (6.17)  by 
E  [ejy]  =  oj,j  =  2,  •  •  • ,  m,  where  E  [e~j]  =  1  is  fixed.  This  allows  for  the  possi¬ 
bility  that  the  utilities  of  some  of  the  alternatives  are  better  captured  by  the 
explanatory  variables  than  other  ones. 


Example  6.4:  Bank  Wages  (continued) 

We  return  to  data  of  employees  of  a  bank  considered  in  earlier  chapters.  The 
jobs  in  the  bank  are  divided  into  three  categories.  One  category  (which  is 
given  the  label  ‘1’)  consists  of  administrative  jobs,  a  second  category  (with 
label  ‘2’)  of  custodial  jobs,  and  a  third  category  (with  label  ‘3’)  of  manage¬ 
ment  jobs.  We  consider  the  job  category  (1,  2,  3)  as  nominal  variable  and  we 
estimate  a  multinomial  logit  model  to  explain  the  attained  job  category  in 
terms  of  observed  characteristics  of  the  employees.  We  will  discuss  (i)  the 
data  and  the  model,  (ii)  the  estimation  results,  (iii)  an  analysis  of  the  marginal 
effects  of  education,  (iv)  the  average  marginal  effects  of  education,  (v)  the 
predictive  performance  of  the  model,  and  (vi)  the  odds  ratios. 

(i)  The  data  and  the  model 

The  dependent  variable  is  the  attained  job  category  (1,  2,  or  3)  of  the  bank 
employee.  As  there  are  no  women  with  custodial  jobs,  we  restrict  the  atten¬ 
tion  to  the  258  male  employees  of  the  bank  (a  model  for  all  474  employees  of 
the  bank  is  left  as  an  exercise  (see  Exercise  6.14)).  As  explanatory  variables 
we  use  the  education  level  (xz,  in  years)  and  the  variable  ‘minority’  (X3  =  1 
for  minorities  and  X3  =  0  otherwise).  The  multinomial  model  logit  model 
(6.20)  for  the  m  =  3  job  categories  has  k  =  3  explanatory  variables  (the 
constant  term  and  xz  and  X3).  We  take  the  first  job  category  (administration) 
as  reference  category.  The  model  contains  in  total  six  parameters,  a  3  x  1 
vector  for  )°b  category  2  (custodial  jobs)  and  a  3  x  1  vector  /?3  for  job 
category  3  (management).  For  an  individual  with  characteristics  x„  the 
probabilities  for  the  three  job  categories  are  then  given  by 


Pn  = 


1 


1  +  e^iPi  +  e*'/h 


Pn-  = 


1  +  ’ 


Pi  3  = 


oX'jPl 


1  + 
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(ii)  Estimation  results 

The  results  of  the  multinomial  logit  model  are  in  Panel  1  of  Exhibit  6.5. 
The  outcomes  show  that  the  minority  effect  is  significant  for  management 
jobs,  but  not  for  custodial  jobs.  The  education  effect  is  significant  for  both 
job  categories,  with  a  positive  coefficient  of  1.63  for  management  jobs 
and  a  negative  coefficient  of  —0.55  for  custodial  jobs.  Note,  however,  that 
these  coefficients  do  not  have  the  interpretation  of  marginal  effects,  not 
even  their  signs,  see  (6.23).  The  marginal  effects  are  analysed  below  in 
part  (iii).  Panel  2  of  Exhibit  6.5  contains  the  results  of  the  model  without 
the  variables  education  and  minority.  The  corresponding  LR- test  on  the  joint 
significance  of  education  and  minority  has  value  LR  =  2(  —  118.7  +  231.3) 
=  225.2  (see  Exhibit  6.5,  Panels  1  and  2).  This  test  corresponds  to  four 
restrictions,  and  the  5  per  cent  critical  value  of  the  corresponding  %2( 4) 
distribution  is  9.49,  so  that  the  two  explanatory  variables  are  clearly  jointly 
significant. 


(iii)  Analysis  of  the  marginal  effects  of  education 

In  multinomial  logit  models  the  sign  of  the  marginal  effect  of  an  explanatory 
variable  is  not  always  the  same  as  the  sign  of  the  corresponding  coefficient. 
We  will  now  analyse  the  marginal  effect  of  education  on  the  probabilities  to 
attain  a  job  in  the  three  job  categories.  The  coefficient  of  education  (x2)  is 
P22  =  —0.55  for  custodial  jobs  and  /?32  =  1.63  for  management  jobs.  For 
administrative  jobs  the  coefficient  of  education  is  by  definition  /f12  =  0,  as 
this  is  the  reference  category.  For  an  individual  with  characteristics 
Xj  =  (1,  x2„  X3,)',  the  estimated  marginal  effects  of  education  are  obtained 
from  (6.23),  with  the  following  results. 


dPMNL[yi  =  1J 
dx2i 


=  pn(0.55pi2  -  1.63p,s), 


dPMNL[y,  =  2J 
dx2i 


=  Pa(  ~  0.55(1  —  pa)  -  1-63 pi3)  <  0, 


dPMNL[yt  =  3J 

dX2i 


=  p,3(0.55pj2  +  1.63(1  -  pa))  >  0. 


Elere  we  used  the  fact  that  the  probabilities  pij  satisfy  0  <  p,f  <  1.  So  we 
conclude  that  additional  education  leads  to  a  lower  probability  of  getting  a 
custodial  job  and  a  higher  probability  of  getting  a  management  job,  as  could 
be  expected.  The  effect  on  the  probability  of  attaining  an  administrative  job 
is  positive  if  and  only  if  0.5 Spa  —  1.63p,3  >  0  —  that  is,  as  long  as  the 
probability  of  a  custodial  job  for  this  individual  is  at  least  1.63/0.55  ~  3 
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times  as  large  as  the  probability  of  a  management  job.  The  interpretation  is 
as  follows.  If  someone  is  most  suited  to  a  custodial  job,  then  additional 
education  may  lead  more  quickly  to  a  job  in  administration.  On  the  other 
hand,  if  someone  already  has  some  chances  of  a  management  job,  then 
additional  education  decreases  the  chance  of  an  administrative  job  in  favour 
of  a  management  job. 


(a) 


Panel  1:  MULTINOMIAL  LOGIT 
Method:  Maximum  Likelihood  (Marquardt) 
Sample:  1  474  IF  (GENDER=1) 

Included  observations:  258 
Convergence  achieved  after  33  iterations 


Cat 

Variable 

Beta 

Coefficient 

Std.  Error 

z-Statistic 

Prob. 

Cat  2  : 

C 

B2(l) 

4.760717 

1.268015 

3.754465 

0.0002 

EDUC 

B2(2) 

-0.553399 

0.114211 

-4.845405 

0.0000 

MINORITY 

B2(3) 

0.426952 

0.488181 

0.874578 

0.3818 

Cat  3  : 

C 

B3(l) 

-26.01435 

2.717261 

-9.573738 

0.0000 

EDUC 

B3(2) 

1.633386 

0.168697 

9.682362 

0.0000 

MINORITY 

B3(3) 

-2.109115 

0.636723 

-3.312454 

0.0009 

Log  likelihood 

-118.7360 

Akaike  info  criterion 

0.966946 

Avg.  log  likelihood 

-0.460217 

Schwarz  criterion 

1.049573 

Number  of  Coefs. 

6 

Panel  2:  MULTINOMIAL  LOGIT 

Method:  Maximum  Likelihood  (Marquardt) 
Sample:  1  474  IF  (GENDER=1) 

Included  observations:  258 

Convergence  achieved  after  10  iterations 

Cat  Variable 

Beta 

Coefficient 

Std.  Error 

z-Statistic 

Prob. 

Cat  2:  C 

B2(l) 

-1.760409 

0.208342 

-8.449604 

0.0000 

Cat  3:  C 

B3(l) 

-0.752181 

0.141007 

-5.334355 

0.0000 

Log  likelihood 

Avg.  log  likelihood 
Number  of  Coefs. 

-231.3446 

-0.896684 

2 

Akaike  info  criterion 
Schwarz  criterion 

1.808873 

1.836415 

Panel  3:  MARGINAL  EFFECTS  OF  EDUCATION  ON  PROBABILITIES  JOBCAT 

JOBCAT  = 1 

JOBCAT  =  2 

JOBCAT :  3 

NON  -  MINORITIES 

-0.127 

-0.030 

0.157 

MINORITIES 

0.012 

-0.062 

0.049 

Exhibit  6.5  Bank  Wages  (Example  6.4) 

Multinomial  logit  model  for  attained  job  category  of  male  employees  (Panel  1:  category  1 
(administration)  is  the  reference  category,  category  2  (custodial  jobs)  has  coefficients  B2(l), 
B2(2),  and  B2(3),  and  category  3  (management)  has  coefficients  B3(l),  B3(2),  and  B3(3)), 
multinomial  model  without  explanatory  variables  (except  constant  terms  for  each  job  cate¬ 
gory,  Panel  2),  and  the  marginal  effects  of  education  on  the  probability  of  attaining  the  three 
job  categories  (Panel  3:  the  reported  numbers  are  averages  over  the  two  subsamples  of  non¬ 
minority  males  and  minority  males). 
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Panel  4:  PREDICTION-REALIZATION  TABLE 

jobcat  =  1 

Actual 
jobcat  =  2 

jobcat  =  3 

predicted 

total 

jobcat  =  1 

138 

14 

7 

159 

predicted  jobcat  =  2 

10 

13 

0 

23 

jobcat  =  3 

9 

0 

67 

76 

actual  total 

157 

27 

74 

258 

random  hit  rate  (157/258)2  +  (27/258)2 

+  (74/258)2 

=  0.464 

Z-value  =  (218-119. 6)7^(258  *  0.464  * 

0.536)  =  12.28,  P  =  0.0000 

(e)  (f) 


Prediction-realization  table  of  the  predicted  and  actual  job  categories  for  the  multinomial 
model  of  Panel  1  (Panel  4),  and  relation  between  the  logarithm  of  the  odds  ratio  (on  the  vertical 
axis)  against  education  (on  the  horizontal  axis)  for  non-minority  males  (e)  and  for  minority 
males  (f). 


(iv)  Average  marginal  effects  of  education 

Panel  3  of  Exhibit  6.5  shows  the  average  marginal  effect  of  education  on  the 
probabilities  of  having  a  job  in  each  of  the  three  categories.  The  estimated 
marginal  effects  are  averaged  over  the  relevant  subsamples  of  minority  males 
and  non-minority  males.  With  more  education  the  chance  of  getting  a 
management  job  increases  and  of  getting  a  custodial  job  decreases.  For 
management,  the  effects  are  much  larger  for  non-minority  males  (around 
16  per  cent  more  chance  for  one  additional  year  of  education)  than  for 
minority  males  (around  5  per  cent  more  chance). 

(v)  Predictive  performance 

Panel  4  of  Exhibit  6.5  shows  actual  against  predicted  job  categories,  where 
an  individual  is  predicted  of  having  a  job  in  the  category  with  the  highest 
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estimated  probability.  The  predictions  are  quite  successful  for  jobs  in  admin¬ 
istration  and  management  but  somewhat  less  so  for  custodial  jobs,  as  for 
around  half  the  people  with  custodial  jobs  it  is  predicted  that  they  will  work 
in  administration.  If  the  estimated  probabilities  of  having  a  custodial  job  are 
added  over  all  n  =  258  individuals,  then  the  predicted  total  number  is  equal 
to  27,  but  for  fourteen  individuals  in  job  category  2  it  is  predicted  to  be  more 
likely  that  they  belong  to  job  category  1.  The  hit  rate  is  equal  to 
(138  +  13  +  67) / 25 8  =  218/258  =  0.845,  whereas  the  expected  hit  rate 
of  random  predictions  is  equal  to  (157/ 25 8 )2  +  (27/258)1  +  (74/25 8 )2  = 
0.464.  To  test  the  classification  success  of  the  model,  these  hit  rates  can  be 
compared  by 

z  =  (0.845  -  0.464)/V0.464(l  -  0.464)/258  =  12.28,  P  =  0.0000. 

This  shows  that  the  model  indeed  provides  significantly  better  predictions 
than  would  be  obtained  by  random  predictions. 

(vi)  Odds  ratios 

Exhibit  6.5  gives  the  log-odds  (as  a  function  of  education)  of  job  category  3 
against  job  categories  1  and  2,  for  non-minority  male  employees  ( e )  and  for 
male  employees  belonging  to  minorities  (f).  The  odds  ratios  are  higher  for 
non-minority  males,  and  the  odds  ratios  are  larger  with  respect  to  category  2 
than  with  respect  to  category  1.  Recall  that  in  the  logit  model  the  log-odds  is 
a  linear  function  of  the  explanatory  variables.  The  odds  ratios  become  very 
large  for  high  levels  of  education.  This  corresponds  to  relatively  large  prob¬ 
abilities  for  a  management  job,  as  could  be  expected. 

Exercises:  T:  6.3;  E:  6.13e,  6.14,  6.15c,  d. 


6.2.3  Ordered  response 
Model  formulation 

In  some  situations  the  alternatives  can  be  ordered  —  for  instance,  if  the 
dependent  variable  measures  opinions  (degree  of  agreement  or  disagree¬ 
ment  with  a  statement)  or  rankings  (quality  of  products).  Such  a  variable  is 
called  ordinal  —  that  is,  the  outcomes  are  ordered,  although  their  numerical 
values  have  no  further  meaning.  We  follow  the  convention  of  labelling  the  m 
ordered  alternatives  by  integers  ranging  from  1  to  m.  In  the  ordered  response 
model,  the  outcome  y,  is  related  to  the  index  function 
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y*  =  x'fi  +  Si,  E[Si]  =  0. 

The  observed  outcome  of  y,  is  related  to  the  index  y*  by  means  of  (m  —  1) 
unknown  threshold  values  <  xi  <  •  •  •  <  xm-  \  in  the  sense  that 

y,  =  1  if  -  oo  <  y*  <  ti, 

y,  =  j  if  ij- 1  <  y*  <  t/,  /  =  2,  •  •  • ,  m  -  1, 

yi  =  m  if  t,„_i  <  y*  <  oo. 

The  index  y*  is  not  observed,  and  the  measured  response  is  y,  =  j  if  the  index 
falls  between  the  threshold  values  t7_i  and  t7.  The  unknown  parameters  of 
this  model  are  /?  and  the  (m  —  1)  threshold  values.  The  constant  term  should 
be  excluded  from  the  explanatory  variables  x,-,  as  otherwise  the  threshold 
parameters  are  not  identified.  When  there  are  only  m  =  2  alternatives,  this 
gives  the  binary  response  model  of  Section  6.1.1,  where  the  unknown  thresh¬ 
old  value  plays  the  role  of  the  unknown  constant  term.  Indeed,  for  m  —  2  we 
get  P[y,  =  1J  =  P[y*  <  i\]  =  P[e;  <  x\  —  x'Ji]  =  P(ti  —  x'/i),  where  P  is  the 
cumulative  distribution  of  e(. 

As  compared  with  the  multinomial  model  (6.16)  with  =  x'/i;  +  e,7,  the 
ordered  response  model  has  the  advantage  that  it  uses  only  a  single  index 
function.  Whereas  the  multinomial  model  contains  (m  —  l)k  parameters,  the 
ordered  response  model  has  k  +  m  —  2  parameters  and  this  is  considerably 
less  (for  k  >  2)  if  the  number  of  alternatives  m  is  large. 

Marginal  effects  in  ordered  response  models 

Let  P  be  the  cumulative  distribution  function  of  £,,  then 

Pij  =  P[y<  =  /]  =  P[y- i  <y*  <  =  pbl  <  t!  -  p[y*  <  t-i] 

(6.24) 

=  F(xj  -  x'Ji)  -  F( t7_i  -  x'/i),  ;  =  1,  •  •  • ,  m. 

Here  we  use  the  notation  to  =  — oo  and  t,„  =  oo,  so  that  P[y,  =  1] 
=  P(ti  —  x'/l)  and  P[y,  =  m]  =  1  —  P(tm_i  —  x'/i).  The  marginal  effects  of 
changes  in  the  explanatory  variables  are  given  by 

^  =  (^T/-i  -  x'iP)  ~  fbj  ~ 


where  f  is  the  density  function  of  a,.  When  x'/i  increases,  this  leads  to  larger 
values  of  the  index  y*,  so  that  the  outcome  of  y,  tends  to  become  larger. 
The  probability  of  the  outcome  y,  =  1  will  decrease,  that  of  y,  =  m  will 
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Exhibit  6.6  Ordered  response 

Effect  of  changes  in  the  x-variables  (measured  by  y*  =  x'fi  +  e  on  the  horizontal  axis)  on  the 
choice  probabilities  (y  =  1  if  y*  <  3.5,  y  =  2  if  3.5  <  y*  <  7.5,  and  y  =  3  if  y*  >  7.5). 


increase,  and  that  of  y,-  =  j  for ;  =  2,  •  •  • ,  m  can  increase  or  decrease,  as  at  the 
same  time  P[y,  <  /  —  1]  =  F(t/_i  —  x'fi)  decreases  and  P[y,  >/+!]  = 
1  —  F(xj  —  x'fi)  increases. 

This  is  illustrated  in  Exhibit  6.6  for  the  case  of  m  =  3  possible  outcomes 
and  two  threshold  values  for  index  values  3.5  and  7.5.  The  left  density  in  the 
figure  corresponds  to  x'fi  =  5,  in  which  case  the  probabilities  for  the  out¬ 
comes  1  and  3  are  both  quite  small.  The  right  density  in  the  figure  corres¬ 
ponds  to  x'fi  =  7,  in  which  case  the  probability  of  the  outcome  1  is  nearly 
zero.  As  compared  with  x'fi  =  5,  the  probability  of  the  outcome  3  for  x'fi  =  7 
has  become  much  larger.  And  the  probability  of  the  outcome  2  has  decreased, 
because  the  loss  in  the  right  tail  (to  alternative  3)  is  larger  than  the  gain  in  the 
left  tail  (from  alternative  1). 

Estimation  of  ordered  logit  and  probit  models 

The  parameters  in  an  ordered  response  model  can  be  estimated  by  maximum 
likelihood.  The  log-likelihood  is 


l°g  (Lfi,  Tl,  -  Tffi—l ) )  =  EE  yn  log  (Pa)  =  log  (p,y,), 

i—  1  7=1  i=l 

with  pij  as  defined  in  (6.24)  and  with  y,-;  =  1  if  y,  =  /  and  y,7  =  0  if  y,  j.  The 
function  F  should  be  specified,  and  in  practice  one  often  takes  the  standard 
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normal  or  the  logistic  distribution.  These  are  called  the  ordered  probit  and 
the  ordered  logit  model  respectively. 

Diagnostic  tests 

Diagnostic  tests  for  the  ordered  response  model  are  similar  to  those  for  the 
multinomial  and  conditional  models  discussed  in  Section  6.2.2.  The  joint 
significance  of  the  explanatory  variables  x,  can  be  tested  by  the  LR- test  for 
the  hypothesis  that  ft  —  0.  The  predictive  quality  can  be  evaluated  by  a 
classification  table.  If  an  alternative  is  relatively  rarely  chosen  in  the  sample, 
then  the  corresponding  estimated  threshold  values  will  have  large  standard 
errors.  Such  an  alternative  can  sometimes  be  better  combined  with  neigh¬ 
bouring  alternatives. 

Additional  diagnostic  tests  can  be  obtained  by  dividing  the  ordered  alter¬ 
natives  into  two  groups  and  by  applying  the  diagnostic  tools  discussed  in 
Section  6.1.4  on  the  resulting  two  (groups  of)  alternatives. 

Example  6.5:  Bank  Wages  (continued) 

We  continue  the  analysis  in  Example  6.4  concerning  the  three  job  categories 
of  male  employees  of  a  bank.  We  will  now  treat  the  job  category  as  an  ordinal 
variable  and  we  discuss  (i)  the  ordering  of  the  three  job  categories,  (ii)  the 
outcomes  of  the  ordered  logit  model,  (iii)  the  outcomes  of  the  ordered  probit 
model,  and  (iv)  the  effect  of  additional  education. 

(i)  Ordering  of  the  three  job  categories 

Instead  of  the  multinomial  logit  model  estimated  in  Example  6.4,  we  now 
consider  an  ordered  logit  model.  The  three  alternative  job  categories  are 
ordered  so  that  y,  =  1  for  custodial  jobs  (the  former  ‘second’  category), 
y,  =  2  for  administrative  jobs  (the  former  ‘first’  category),  and  y,  =  3  for 
management  jobs  (the  former  ‘third’  category).  This  ordering  is  chosen  as 
this  corresponds  to  increasing  average  wages. 

(ii)  Outcomes  of  the  ordered  logit  model 

The  estimation  results  are  in  Panel  1  of  Exhibit  6.7.  The  ML  estimates  show  a 
positive  effect  (0.87)  of  education  and  a  negative  effect  (  —  1.06)  of  minority. 
The  LR- test  (LR  =  202  with  P  =  0.0000)  shows  that  the  variables  are  jointly 
significant.  Exhibit  6.7,  Panel  2,  is  a  classification  table  comparing  actual  and 
predicted  job  categories.  The  predictions  are  quite  successful  on  average, 
although  too  many  employees  are  predicted  of  having  a  job  in  administration 
(189  instead  of  157)  and  too  few  in  the  other  two  categories.  The  estimated 
probabilities  sum  up  to  the  actual  numbers  of  individuals  in  the  three 
categories.  This  always  holds  true  for  logit  models. 
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(iii)  Outcomes  of  the  ordered  probit  model 

For  comparison,  Panels  3  and  4  of  Exhibit  6.7  show  the  results  of  the  ordered 
probit  model.  The  estimates  are  in  line  with  those  obtained  for  the  ordered 
logit  model,  taking  into  account  the  scaling  factor  of  around  1.6  to  compare 
probit  with  logit  estimates.  The  classifications  in  Panel  4  are  also  very  similar 
to  those  of  the  ordered  logit  model  in  Panel  2.  However,  the  estimated 
probabilities  no  longer  sum  up  to  the  actual  totals  per  job  category,  although 
the  differences  are  very  small. 

(iv)  Effect  of  additional  education 

The  effect  of  having  16  instead  of  12  years  of  education  (approximately  that 
of  having  a  university  degree)  for  non-minority  males  in  the  probit  model  is 
represented  graphically  in  Exhibit  6.7  (e).  The  index  y*  =  x'fi  +  e,  is  on  the 
horizontal  axis.  The  left  density  is  for  12  years  of  education  and  the  right 
density  for  16  years  of  education.  In  the  probit  model  e,  ~  N(0, 1),  so  that 


Panel  1:  Dependent  Variable:  ORDERJOBCAT 

Method:  ML  -  ORDERED  LOGIT;  Number  of  ordered  outcomes:  3 
Sample(adjusted):  1  472  IF  GENDER=1;  Included  observations:  258 
Convergence  achieved  after  9  iterations 

Variable 

Coefficient 

Std.  Error 

z- Statistic 

Prob. 

EDUC 

0.870026 

0.089099 

9.764700 

0.0000 

MINORITY 

-1.056442 

0.375384 

-2.814296 

0.0049 

Limit  Points 

LIMIT  2:C(3) 

7.952259 

1.004817 

7.914141 

0.0000 

LIMIT_3:C(4) 

14.17223 

1.429637 

9.913163 

0.0000 

Log  likelihood 

-130.3198 

Akaike  info  criterion 

1.041239 

Restr.  log  likelihood 

-231.3446 

Schwarz  criterion 

1.096323 

LR  statistic  (2  df) 

202.0495 

Probability  (LR  stat) 

0.000000 

( b ) 


Panel  2:  Dependent  Variable:  ORDERJOBCAT 
Method:  ML  -  ORDERED  LOGIT 

Sample(adjusted):  1  472  IF  GENDER=1;  Included  observations:  258 
Prediction  table  for  ordered  dependent  variable 

Count  of  obs  Sum  of  all 


Value  Count  with  Max  Prob  Error  Probabilities  Error 


1 

27 

23 

4 

27 

0 

2 

157 

189 

-32 

157 

0 

3 

74 

46 

28 

74 

0 

Exhibit  6.7  Bank  Wages  (Example  6.5) 

Ordered  logit  model  (Panel  1)  for  achieved  job  categories,  ranked  by  the  variable 
ORDERJOBCAT  (with  value  1  for  ‘custodial’  jobs,  2  for  ‘administrative’  jobs,  and  3  for 
‘management’  jobs),  with  classification  table  of  predicted  job  categories  (Panel  2). 
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Panel  3:  Dependent  Variable:  ORDERJOBCAT 

Method:  ML  -  ORDERED  PROBIT;  Number  of  ordered  outcomes:  3 
Sample(adjusted):  1472  IF  GENDER=1;  Included  observations:  258 
Convergence  achieved  after  5  iterations 

Variable 

Coefficient 

Std. Error 

z-Statistic 

Prob. 

EDUC 

0.479043 

0.046617 

10.27624 

0.0000 

MINORITY 

-0.509259 

0.213978 

-2.379963 

0.0173 

Limit  Points 

LIMIT  2:C(3) 

4.443056 

0.556591 

7.982620 

0.0000 

LIMIT_3:C(4) 

7.843644 

0.744473 

10.53583 

0.0000 

Log  likelihood 

-131.2073 

Akaike  info  criterion 

1.048119 

Restr.  log  likelihood 

-231.3446 

Schwarz  criterion 

1.103203 

LR  statistic  (2  df) 

200.2746 

Probability(LR  stat) 

0.000000 

(d) 


Panel  4:  Dependent  Variable:  ORDERJOBCAT 
Method:  MI  -  ORDERED  PROBIT 

Sample(adjusted):  1  472  IF  GENDER=1;  Included  observations:  258 
Prediction  table  for  ordered  dependent  variable 

Count  of  obs  Sum  of  all 


Value 

Count 

with  Max  Prob 

Error 

Probabilities 

Error 

1 

27 

23 

4 

27.626 

-0.626 

2 

157 

189 

-32 

156.600 

0.400 

3 

74 

46 

28 

73.773 

0.227 

(e) 


Exhibit  6.7  (Contd.) 

Ordered  probit  model  (Panel  3)  for  achieved  job  categories,  ranked  by  the  variable 
ORDERJOBCAT  (with  value  1  for  ‘custodial’  jobs,  2  for  ‘administrative’  jobs,  and  3  for 
‘management’  jobs),  with  classification  table  of  predicted  job  categories  (Panel  4).  (e)  shows 
the  graphs  of  two  probability  distributions  (corresponding  to  the  ordered  probit  model)  for 
non-minority  males,  the  left  one  for  an  education  level  of  12  years  and  the  right  one  for  an 
education  level  of  16  years,  with  the  index  y*  on  the  horizontal  axis  (the  limit  points  4.44  and 
7.84  for  the  three  job  categories  are  taken  from  Panel  3). 


480  6  Qualitative  and  Limited  Dependent  Variables 


y*  ~  N (x'jfi,  1),  that  is,  both  densities  are  normal  with  standard  deviation  1, 
the  left  density  has  mean  v'/i  =  0.479  ■  12  —  0.509  ■  0  =  5.75  and  the  right 
density  has  mean  x'fi  =  0.479  ■  16  =  7.66.  The  estimated  threshold  value 
between  custodial  and  administrative  jobs  is  4.44  and  that  between 
administrative  and  management  jobs  is  7.84.  For  12  years  of  education  the 
probability  of  having  an  administrative  job  is  by  far  the  largest,  whereas  for 
16  years  of  education  the  probabilities  of  having  a  management  job  or  an 
administrative  job  are  nearly  equally  large.  In  Exercise  6.13  some  further 
aspects  of  these  data  are  investigated  where  we  also  consider  a  binary  logit 
model  obtained  by  joining  administrative  and  custodial  jobs  into  a  single 
category. 

Exercises:  E:  6.13f,  6.15a. 


6.2.4  Summary 

We  summarize  the  steps  to  model  the  underlying  factors  that  influence  the 
outcome  of  a  multinomial  dependent  variable.  Note  that,  if  the  dependent 
variable  is  a  quantitative  variable  (so  that  the  outcome  values  are  not 
simply  qualitative  labels  but  the  actual  quantitative  measurement  of 
some  quantity  of  interest),  we  should  not  use  the  methods  discussed  in 
this  section,  as  regression-based  methods  may  be  more  informative.  This  is 
discussed  in  Section  6.3,  for  instance,  for  quantitative  dependent  variables 
that  take  only  non-negative  values. 

•  Determine  whether  the  dependent  multinomial  variable  is  a  nominal 
variable  (without  a  natural  ordering  of  the  outcomes)  or  an  ordinal 
variable  (with  a  natural  ordering). 

•  Determine  the  possibly  relevant  explanatory  variables. 

•  For  a  nominal  dependent  variable,  one  can  formulate  either  a  multi¬ 
nomial  model  (if  no  characteristics  of  the  alternative  choices  are  meas¬ 
ured)  or  a  conditional  model  (in  case  the  characteristics  of  the 
alternatives  can  be  measured  for  each  individual). 

•  Multinomial  and  conditional  logit  models  are  easily  estimated  by  max¬ 
imum  likelihood.  However,  the  use  of  the  logit  model  requires  that  the 
alternatives  are  sufficiently  distinct  from  each  other  (the  ‘independence 
of  irrelevant  alternatives’).  Otherwise  one  can  estimate  a  multinomial  or 
conditional  probit  model  by  maximum  likelihood,  at  the  expense  of 
more  involved  numerical  integration  and  optimization  techniques. 

•  For  an  ordinal  dependent  variable,  one  can  estimate  an  ordered  logit  or 
ordered  probit  model.  As  these  models  are  easier  to  estimate  and  inter¬ 
pret  than  multinomial  logit  and  probit  models,  it  is  advantageous  to 
exploit  the  ordered  nature  of  the  outcomes  of  the  dependent  variable. 
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•  All  mentioned  models  can  be  evaluated  in  a  similar  way  as  binary 
dependent  variable  models  —  for  instance,  by  testing  the  individual 
and  joint  significance  of  parameters,  by  determining  (mean)  marginal 
effects  of  the  explanatory  variables,  by  plotting  odds  ratios,  by  evaluat¬ 
ing  the  predictive  performance,  and  so  on.  Some  care  is  needed  in 
interpreting  individual  coefficients,  especially  in  multinomial  logit 
models,  and  one  often  gets  a  better  interpretation  of  the  model  by 
computing  the  (mean)  marginal  effects  instead. 
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6.3  Limited  dependent 
variables 


6.3.1  Truncated  samples 

Uses  Chapters  1-4;  Section  5.6;  Section  6.1. 


Different  types  of  limited  dependent  variables 

In  the  two  foregoing  sections  we  considered  models  for  qualitative,  discrete¬ 
valued  (nominal  or  ordinal)  dependent  variables.  Now  we  will  investigate 
models  for  limited  dependent  variables  —  that  is,  quantitative,  continuous¬ 
valued  variables  with  outcomes  that  are  restricted  in  some  way.  So  the 
dependent  variable  is  an  interval  variable,  in  the  sense  that  the  numerical 
differences  in  observed  values  have  a  quantitative  meaning.  However,  in 
contrast  with  the  regression  model  discussed  in  Chapters  2-5,  the  dependent 
variable  cannot  take  any  arbitrary  real  value,  as  there  exist  some  restrictions 
on  the  possible  outcomes.  We  analyse  four  types  of  limited  dependent  vari¬ 
ables.  In  this  section  we  consider  truncated  samples  where  the  observations 
can  be  obtained  only  from  a  limited  part  of  the  underlying  population.  If  the 
selection  mechanism  can  be  modelled  in  some  way,  one  can  employ  the 
methods  discussed  in  Section  6.3.3.  Section  6.3.2  treats  models  for  censored 
data  where  the  possible  observed  outcomes  are  limited  to  an  interval.  A 
special  case  consists  of  duration  data,  discussed  in  Section  6.3.4.  Although 
least  squares  is  not  appropriate  for  these  types  of  data,  in  Sections  6.3.1  and 
6.3.2  we  will  pay  detailed  attention  to  the  properties  of  least  squares  estima¬ 
tors  as  this  analysis  provides  suggestions  for  better  estimation  methods. 

Truncated  observations 

Suppose  that  the  dependent  variable  y;  and  the  independent  variables  x,  are 
related  by  y,-  =  x'fi  +  st.  A  sample  is  called  truncated  if  we  know  beforehand 
that  the  observations  can  come  only  from  a  restricted  part  of  the  underlying 
population  distribution. 

For  instance,  suppose  that  the  data  concern  the  purchases  of  new  cars,  with 
y,  the  price  of  the  car  and  x,  characteristics  of  the  buyer  like  age  and  income 
class.  Then  no  observations  on  y,  can  be  below  the  price  of  the  cheapest  new 
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car.  Some  households  may  want  to  buy  a  new  car  but  find  it  too  expensive,  in 
which  case  they  do  not  purchase  a  new  car  and  are  not  part  of  the  observed 
data.  This  truncation  effect  should  be  taken  into  account,  for  instance,  if  one 
wants  to  predict  the  potential  sales  of  a  cheaper  new  type  of  car,  because  most 
potential  buyers  will  not  be  part  of  the  observed  sample. 

A  model  for  truncated  data 

The  truncation  can  be  from  below  (as  in  the  above  example  of  prices  of  new 
cars),  from  above  (so  that  y,  cannot  take  values  above  a  certain  threshold),  or 
from  both  sides.  We  will  consider  the  situation  that  the  truncation  is  from 
below  with  known  truncation  point.  Other  types  of  truncation  can  be  treated 
in  a  similar  way  (see  Exercise  6.5).  We  further  assume  that  the  truncation 
point  is  equal  to  zero,  which  can  always  be  achieved  by  measuring  y,  in 
deviation  from  the  known  truncation  point.  It  is  assumed  that,  in  the 
untruncated  population,  the  relation  between  the  dependent  variable  (y*) 
and  the  explanatory  variables  (x,)  is  linear.  For  later  purposes  it  is  convenient 
to  write  the  model  as 

y*  =  x'jl  +  <je„  <•:,  ~  IID,  £[{;,]  =  0. 

Here  a  is  a  scale  parameter  and  e,  is  an  error  term  with  known  symmetric  and 
continuous  density  function  f.  For  example,  if  £,  follows  the  standard  normal 
distribution,  then  a  is  the  unknown  standard  deviation  of  the  error  terms.  The 
above  formulation  of  the  regression  equation  differs  from  the  usual  one  in 
Chapters  2-5,  as  it  explicitly  contains  the  scale  factor  a.  This  is  convenient  in 
what  follows,  since  by  extracting  the  scale  factor  a  we  can  now  assume  that  the 
density  function  f  of  the  (normalized)  error  terms  £,  is  completely  known.  The 
observed  data  are  assumed  to  satisfy  this  model,  but  the  sample  is  truncated  in 
the  sense  that  individuals  with  y*  <  0  are  not  observed.  In  the  car  sales 
example,  y*  may  be  interpreted  as  the  amount  of  money  that  an  individual 
wants  to  spend  on  a  new  car,  and,  if  this  is  less  than  the  price  of  the  cheapest  car, 
then  this  individual  will  not  buy  a  new  car.  So  the  sample  comes  from  a 
subpopulation  —  that  is, 

y,  =  y*,  =  x'iP  +  Mi  if  y]  >  o, 

y,  is  not  observed  if  yl  <  o.  ' 

A  graphical  illustration  of  truncation 

The  effect  of  truncation  is  illustrated  graphically  in  Exhibit  6.8.  This  corres¬ 
ponds  to  the  above  truncated  regression  model,  with  y*  =  x,  +  £,  and 
£,  ~  NID(0,  1).  If  x,  =  1,  then  the  corresponding  value  of  y,  is  observed  if 
and  only  if  Sj  >  —1  — that  is,  the  error  term  comes  from  the  standard  normal 
distribution  truncated  on  the  left  at  the  value  —1.  In  general,  for  a  given  value 
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Exhibit  6.8  Truncated  data 

(a)  shows  a  truncated  normal  density  with  truncation  from  below  (at  x  =  —1).  (b)-(d)  show 
scatter  diagrams  illustrating  the  effect  of  truncation  on  the  OLS  estimates;  ( b )  is  the  untrun¬ 
cated  scatter  of  y*  against  x,  ( c )  is  the  truncated  scatter  of  y  against  x,  and  (d)  contains  the  two 
regression  lines  (the  DGP  has  slope  /?=!). 


of  Xj,  the  density  of  e,  is  truncated  at  the  value  — x;.  The  scatter  diagrams  in 
Exhibit  6.8  (b)  and  ( c )  illustrate  that  the  truncation  effect  is  large  for  small 
values  of  x,  and  small  for  large  values  of  xt.  For  small  values  of  x,  we  get 
observations  only  for  relatively  large  values  of  the  disturbances.  This  means 
that,  on  the  left  part  of  the  scatter  diagram,  the  observed  values  tend  to  lie 
above  the  model  relation  y  =  x,  whereas  on  the  right  part  the  observations 
are  scattered  more  symmetrically  around  this  line.  This  leads,  in  this  model 
with  a  positive  slope  /?  =  1,  to  a  downward  bias  of  the  OLS  estimator  (see 
Exhibit  6.8  (d)). 

□  The  truncated  density  function  of  the  error  terms 

We  now  analyse  the  effect  of  truncation  more  generally  for  the  model  (6.25).  In 
the  observed  sample  there  holds  y*  >  0,  so  that  £,  comes  from  the  truncation  of  the 
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distribution  f  with  e,-  >  —x'fi/a.  The  cumulative  distribution  of  the  error  term  of 
the  zth  observation  is  therefore  given  by 


P[£;  <  t\sj  >  —xfi/a]  =0  if  t  <  —x'fi/a, 

_  P[-  x'fi/a  <  £;  <  t ] 
P[Si  >  ~xfi/o\ 


Fit)  -  P{-xfi/a) 
F(x'fi/a) 


if  t  >  —  x'fi/a. 


Here  F  denotes  the  cumulative  distribution  corresponding  to  the  density 
f,  and  we  used  the  continuity  and  symmetry  of  f  so  that  P[e,  >  —  a]  = 
P[sj  <  a]  =  P[sj  <  a]  =  F(a).  The  density  function  of  the  error  terms  £,  of  the 
data  generating  process  (that  is,  with  observations  y,-  >  0)  is  obtained  by  differen¬ 
tiating  the  above  cumulative  distribution  function  with  respect  to  the  argument  t. 
This  gives  the  truncated  density  function  fj  defined  by 


fi(t)  =  0  for  t<—  x[P/(T, 

m  =  TWF)  (or  ' > 

So  the  truncated  density  of  the  error  terms  is  proportional  to  the  ‘right  part’  (with 
t  >  —x'fi/a)  of  the  original  density/.  The  scaling  factor  Fix'fi/o)  is  needed  to  get  a 
density  —  that  is,  with  f  fi(t)dt  =  1. 


Derivation  of  the  truncated  density  function  of  the  dependent 
variable 


The  foregoing  results  for  the  density  function  of  the  error  terms  £,•  can  be  used  to 
derive  the  density  function  p{yi)  of  the  dependent  variable  y,  =  x'Ji  +  £;.  As  y-,  is 
observed,  this  means  that  y,  >  0,  so  that  the  error  term  comes  from  the  truncated 
distribution  with  £,-  >  —x'fija.  Let  Fyi  denote  the  cumulative  density  of  y,;  then  for 
t  >  0  we  get 


Fy-(t)  =  P[y,-  <  t  |  £,  >  -x'/l/er]  =  P[x'fi  +  a £,  <  t  \  £,•  >  -x-jS/ff] 

=  P[£i  <(t-  xfi)/a  |  Si  >  -x.fi  I  a]  = -  F{x!fi/a) - - ' 


The  density  function  is  obtained  by  differentiating  with  respect  to  t  so  that 


P(Vi) 


!/((%'  ~  xfi)/a) 
u  F(x'fi/a) 


(y«  >  0). 


(6.26) 


This  result  can  also  be  obtained  by  applying  (1.10)  (p.  22)  for  the  transformation 
yi  =  x'fi  +  asj  =  g(si),  where  £/  has  density  f[t)/F(xffi/a)  for  t>—x'fi/o. 
The  inverse  transformation  is  e,  =  h(yt)  =  (y,  —  x'tfi)/a  with  derivative 
h’(yt)  =  (1/a).  Then  (6.26)  follows  directly  from  (1.10). 
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Derivation  of  systematic  bias  of  OLS 

The  estimates  of  P  obtained  by  applying  OLS  to  (6.25)  (for  observations  y,-  >  0 
in  the  sample)  are  not  consistent.  Exhibit  6.8  illustrates  this  graphically. 
More  formally,  inconsistency  follows  from  the  fact  that  the  error  terms  with 
distribution  f,  do  not  have  zero  mean,  as  E[e,-|y*  >  OJ  =  £[e,| e,-  >  —xfi/a]  >  0. 
For  instance,  if  f  =  (p  is  the  standard  normal  distribution  with  cumulative  dis¬ 
tribution  <t>,  then  (using  the  notation  z,  =  x'fi/a)  we  get 


E[si\y*  >  0] 


dt 


i  r°° 

®(z>)  J-Z, 


1  1  i  y 

4>{Zi)  ^  (p(xfi/a) 
<&&)  ^(x'fi/a) 


00  1  1  u 

= - e~Ji 

-Zi  3>(z«)  y/2n 

=  Xi  >  0. 


(6.27) 


The  term  A,-  is  called  the  inverse  Mills  ratio,  and  this  expression  for  the  truncation 
bias  is  specific  for  the  normal  distribution  of  the  error  terms  e,.  For  observations  in 
the  sample,  the  mean  value  of  y,  =  x'fi  +  e,  is  not  x'JS  (as  in  the  untruncated 
regression  model)  but  it  is 


E[y,|y*  >  0]  =  x'fi  +  aE[Ei\y*  >  OJ  =  x'fi  +  aXt.  (6.28) 

Let  oj,  =  y,-  —  £[y,|y*  >  0]  =  s,-  —  ul,,  then  E[u>i\y*  >  0]  =  0  and  in  the  observed 
sample  (with  yl  >  0)  we  can  write 

y,-  =  x'fi  +  a  A,-  +  oj„  £[cc,J  =  0. 

If  we  regress  y,-  on  x,,  then  the  (unobserved)  regressor  A,  is  neglected.  This  makes 
OLS  biased  and  inconsistent  (see  also  Section  3.2.3  (p.  142-3)  on  omitted  vari¬ 
ables  bias).  Formally,  the  OLS  estimator  is 


ME 


OLS  is  inconsistent  because  the  probability  limit  plim(l^)  A,x()  ^  0,  as  A,-  is  a 
function  of  x,.  That  is,  the  orthogonality  condition  is  violated  so  that  OLS  is 
inconsistent.  The  bias  of  OLS  will  be  small  if  the  terms  A,  are  small  —  that  is,  if  the 
terms  x'fil a  are  large  (as  (j>(zi)  — >  0  and  <&(z,-)  — >  1  so  that  A,-  — >  0  for  z,  —j  oo).  In 
this  case  the  truncation  has  only  a  small  effect,  as  the  condition  that  £,-  >  —x'fi fa  is 
then  hardly  a  restriction  anymore. 


Estimation  by  maximum  likelihood 

Consistent  estimates  of  p  are  obtained  by  applying  maximum  likelihood,  using  the 
correct  truncated  density  functions  /)  for  the  error  terms  e,  and  the  corresponding 
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truncated  density  (6.26)  of  the  observations  y,.  For  the  normal  distribution,  the 
corresponding  truncated  density  (6.26)  is  equal  to 


P(Vi) 


1  <t>((yi-x'iP)/o) 
a  <i>(x'jP/(j) 


As  the  observations  y,  are  assumed  to  be  mutually  independent,  the  log-likelihood 
log  (HP,  a))  =  log  (p(yi,  ■  ■  ■  ,yn))  =  E"=i  log  (p(yd)  becomes 

1  ,  11  ” 

log  (L)  =  --log(27t)  -ylog(rr2)  -x'iP)2  -  log  /<*))■ 

i=  1  1=1 

(6.29) 

The  last  term  comes  in  addition  to  the  usual  terms  in  a  linear  regression  model  (see 
(4.30)  in  Section  4.3.2  (p.  227))  and  represents  the  truncation  effect.  This  last  term 
is  non-linear  in  /?  and  a,  and  the  first  order  conditions  for  a  maximum  of  log  (L) 
involve  the  terms  2,  so  that  numerical  integration  is  needed. 


Marginal  effects  in  truncated  models 

Some  care  is  needed  in  interpreting  the  parameters  p.  They  measure  the 
marginal  effects  on  E[y]  of  the  explanatory  variables  x  in  the  (untruncated) 
population.  Therefore  they  are  the  parameters  of  interest  for  out-of-sample 
predictions  —  that  is,  to  estimate  effects  for  unobserved  values  y*  <  0.  If  one 
is  instead  interested  in  within-sample  effects  —  that  is,  in  the  truncated 
population  with  y*  >  0  —  then  for  the  normal  distribution  the  relevant  mar¬ 
ginal  effects  are 


0E\yi\y*  >  0] 

dx, 


=  (1  ~  A?  -  A^p/ajp. 


(6.30) 


This  measures  the  effect  of  each  explanatory  variable  on  the  expected  value 
of  the  response  of  an  individual  in  the  sample.  The  correction  term  in  front  of 
P  in  (6.30)  lies  between  zero  and  one  (see  Exercise  6.4).  So  the  marginal 
effects  in  the  truncated  population  are  closer  to  zero  than  those  in  the 
untruncated  population.  For  purposes  of  interpretation,  the  averages  of 
these  effects  over  the  sample  can  be  reported.  Note  that  the  ratios  Pj/Ph 
continue  to  have  the  interpretation  of  the  relative  effect  of  the  ;th  and  hx\\ 
explanatory  variables  on  the  dependent  variable  (as  the  scalar  factor  in  front 
of  P  in  (6.30)  is  the  same  for  all  the  k  elements  of  the  vector  of  explanatory 
variables  xp.  This  implies  that  these  relative  effects  are  the  same  for  the 
untruncated  and  the  truncated  population. 
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Example  6.6:  Direct  Marketing  for  Financial  Product  (continued) 

In  Example  6.1  we  described  direct  marketing  data  concerning  a  new  finan¬ 
cial  product.  Of  the  925  customers,  470  responded  to  the  mailing  by 
investing  in  the  new  product.  Now  we  analyse  the  truncated  sample  consist¬ 
ing  of  these  470  customers.  We  will  discuss  (i)  a  truncated  model  for  the 
invested  amount  of  money,  (ii)  results  of  OLS  and  ML,  and  (iii)  some 
comments  on  the  obtained  results. 

(i)  A  truncated  model  for  the  invested  amount  of  money 

We  relate  the  amount  of  invested  money  to  characteristics  of  the  customer  — 
that  is,  gender  (1  for  males,  0  for  females),  activity  (1  if  the  customer  already 
invests  in  other  products  of  the  bank,  0  otherwise),  and  age  (including  also  a 
squared  term  to  allow  for  non-linear  effects).  We  consider  the  truncated  data 
set  of  470  customers  who  invested  a  positive  amount  of  money.  As  the  variable 
to  be  explained  we  take  y,  =  log  (1  +  invest),  where  ‘invest’  is  the  amount  of 
money  invested.  We  take  logarithms  because  the  distribution  of  the  amount 
of  invested  money  is  very  skewed.  In  the  sample  the  investments  are  positive, 
so  that  y,  >  0.  Let  y*  be  the  ‘inclination  to  invest’;  then  the  model  is  given  by 
(6.25),  where  Xj  is  the  5x1  vector  of  explanatory  variables  (constant,  gender, 
activity,  age,  and  squared  age).  This  is  a  truncated  regression  model. 

(ii)  Results  of  OLS  and  ML 

Exhibit  6.9  shows  the  results  of  OLS  (without  taking  the  truncation  into 
account,  in  Panel  1)  and  of  ML  in  (6.29)  (that  is,  using  the  truncated  normal 
density,  in  Panel  2).  The  outcomes  suggest  that  the  variables  ‘gender’  and 
‘activity’  do  not  have  significant  effects  on  the  amount  of  invested  money  and 
that  age  has  a  significant  effect,  with  a  maximum  at  an  age  of  around  62 
years  (namely,  where  0.0698  —  2  ■  0.0559  ■  (age/100)  =  0).  This  is  somewhat 
surprising,  as  one  would  normally  expect  the  variable  ‘activity’  to  have  a 
positive  effect  on  the  invested  amount  and  the  effect  of  age  to  be  maximal  at 
an  earlier  age. 

(iii)  Comments  on  the  obtained  results 

The  results  of  OLS  and  of  ML  are  nearly  equal.  Exhibit  6.9  (c)  and  (d)  show 
the  histogram  of  the  values  of  zt  =  x'tb/s,  where  b  and  s  are  the  ML  estimates 
of  p  and  er,  and  the  histogram  of  the  corresponding  values  of  the  inverse  Mills 
ratio  a,  =  <p(zi)/$>{Zi).  The  z,  values  are  positive  and  lie  far  away  from  the 
truncation  point  zero,  with  minimal  value  4.06.  Consequently,  the  values  of 
1,  are  very  close  to  zero,  with  maximal  value  0.0001.  This  means  that  the 
correction  term  in  the  log-likelihood  (6.29)  is  very  small  (for  Zi  >  4.06  there 
holds  <J>(z,)  ~  1  and  hence  log(T>(z,))  ~  0)-  This  explains  that  ML  and  OLS 
are  nearly  equivalent  for  these  data.  As  the  estimates  by  OLS  (neglecting  the 
truncation)  and  ML  (with  truncation)  are  close  together,  it  is  tempting  to 
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Panel  1:  Dependent  Variable:  LOGINV  =  LOG(l  +  INVEST) 

Method:  Least  Squares 

Sample(adjusted):  1  1000  IF  INVEST>0 

Included  observations:  470 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

2.854243 

0.611331 

4.668900 

0.0000 

GENDER 

-0.214029 

0.115786 

-1.848499 

0.0652 

ACTIVITY 

-0.132122 

0.099673 

-1.325556 

0.1856 

AGE 

0.069782 

0.024903 

2.802193 

0.0053 

AGEA2/100 

-0.055928 

0.024156 

-2.315258 

0.0210 

R-squared 

0.047838 

S.E.  of  regression 

0.944256 

Panel  2:  Dependent  Variable:  LOGINVEST  =  LOG(l  +  INVEST) 
Method:  ML  -  Truncated  Normal 

Sample(adjusted):  1  1000  IF  INVEST>0 

Included  observations:  470;  truncated  sample  with  left  censoring  value  0 
Convergence  achieved  after  1 1  iterations 

Variable 

Coefficient 

Std.  Error 

z-Statistic 

Prob. 

C 

2.854157 

0.608105 

4.693525 

0.0000 

GENDER 

-0.214033 

0.115170 

-1.858412 

0.0631 

ACTIVITY 

-0.132126 

0.099144 

-1.332676 

0.1826 

AGE 

0.069785 

0.024771 

2.817203 

0.0048 

AGEA2/100 

-0.055931 

0.024029 

-2.327683 

0.0199 

Error 

Distribution 
SCALE:  SIGMA 

0.939228 

0.030637 

30.65627 

0.0000 

R-squared 

S.E.  of  regression 

0.047838 

0.945272 

(c) 


Series:  ZVALUE 
Sample  1  1000 
Observations  470 

Mean  4.955041 

Median  4.987214 

Maximum  5.356398 
Minimum  4.064282 


(d) 


0.00000  0.00004  0.00008 


Exhibit  6.9  Direct  Marketing  for  Financial  Product  (Example  6.6) 


Models  for  invested  amount  of  money  based  on  data  of  470  individuals  who  made  an  invest¬ 
ment,  OLS  (Panel  1)  and  ML  in  truncated  model  (Panel  2).  (c)  and  ( d )  show  histograms  of  the 
values  of  z  =  (x' bMi) / sml  (in  ( c ))  and  of  the  inverse  Mills  ratio  X  =  <p(z)/'t>(z)  (in  (d)). 


conclude  that  the  truncation  has  no  serious  effects.  However,  this  need  not  be 
correct.  To  investigate  these  effects  in  a  proper  way  we  need  further  infor¬ 
mation  on  the  individuals  who  are  excluded  from  the  sample,  here  the 
customers  who  did  not  invest.  In  the  next  section  we  consider  the  situation 
where  we  also  know  the  characteristics  of  the  individuals  who  did  not  invest. 
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As  we  will  see  in  Example  6.7,  we  then  get  quite  different  conclusions  (for 
instance,  on  the  sign  of  the  effect  of  the  variable  activity  and  on  the  question 
at  what  age  the  effect  is  maximal). 

Exercises:  T:  6.4a,  b;  S:  6.9a,  b. 


6.3.2  Censored  data 
Tobit  model  for  censored  data 

The  dependent  variable  is  called  censored  when  the  response  cannot  take 
values  below  (left  censored)  or  above  (right  censored)  a  certain  threshold 
value.  For  instance,  in  the  example  on  investments  in  a  new  financial  prod¬ 
uct,  the  investments  are  either  zero  or  positive.  And,  in  deciding  about  a  new 
car,  one  has  either  to  pay  the  cost  of  the  cheapest  car  or  abstain  from  buying  a 
new  car.  The  so-called  tobit  model  relates  the  observed  outcomes  of  y,  >  0  to 
an  index  function 


Vi  =  x'iP  +  *e« 

by  means  of 

y,  =  y*  =  x'P  +  as,  if  y*  >  0, 

‘  ’  '  (6.31) 

y,  =  0  if  y*  <  0. 

Here  a  is  a  scale  parameter  and  the  error  terms  £,  have  a  known  symmetric 
density  function  f  (so  that  f(t)  =  f(  —  t)  for  all  t)  with  cumulative  distribu¬ 
tion  function  F,  so  that 


E[s,]  =  0. 

In  the  tobit  model,  the  functions  f  and  F  are  usually  chosen  in  accordance  with 
the  standard  normal  distribution  (with  f  =  cp  and  F  =  O).  The  above  model 
for  censored  data  is  sometimes  called  the  tobit  type  1  model,  to  distinguish  it 
from  the  tobit  type  2  model  that  will  be  discussed  in  the  next  section  for  data 
with  selection  effects.  In  contrast  with  a  truncated  sample,  where  only  the 
responses  for  y*  >  0  are  observed,  it  is  now  assumed  that  responses  y,  =  0 
corresponding  to  y*  <  0  are  also  observed  and  that  the  values  of  x,  for  such 
observations  are  also  known.  In  practice  these  zero-responses  are  of  interest, 
as  they  provide  relevant  information  on  economic  behaviour.  For  instance,  it 
is  of  interest  to  know  which  individuals  decided  not  to  invest  (as  other 
financial  products  could  be  developed  for  this  group)  or  which  individuals 
did  not  buy  a  new  car  (as  one  could  design  other  cars  that  appeal  more  to  this 
group).  The  tobit  model  can  be  seen  as  a  variation  of  the  probit  model,  with 
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one  discrete  option  (‘failure’,  y,  =  0)  and  where  the  option  ‘success’  is 
replaced  by  the  continuous  variable  y,  >  0. 

A  graphical  illustration  of  censoring 

If  one  simply  applies  OLS  by  regressing  y,  (including  the  zero  observations) 
on  x„  then  this  leads  to  inconsistent  estimators.  The  reason  is  the  same  as  in 
the  case  of  truncated  samples  —  that  is,  E[y,]  ^  x'/h 

Exhibit  6.10  provides  a  graphical  illustration  of  the  effect  of  censoring. 
Here  the  data  are  generated  by  y*  =  x,  +  a,  with  a.t  ~  N(0,  1)  and  y,  =  y*  if 
y*  >  0  and  y,-  =  0  if  y*  <  0.  For  a  given  value  of  x,,  the  probability  distribu¬ 
tion  of  y,  is  mixed  continuous-discrete.  For  instance,  for  x,  —  0,  the  probabil¬ 
ity  on  the  outcome  y,  =  0  is  P[s,  <  Oj  =  0.5,  outcomes  y,  >  0  have  a  standard 
normal  density,  and  outcomes  y,  <  0  are  not  possible.  Observed  values  y, 
correspond  to  the  uncensored  model  y,  =  x,  +  £,  if  and  only  if  y,  =  y*  —  that 
is,  if  and  only  if  y*  =  x,  +  a,  >  0.  Clearly,  the  condition  s,  >  —  x,  is  hardly  a 


Exhibit  6.10  Censored  data 


(a)  shows  a  censored  normal  density  with  censoring  from  below  (at  x  =  0),  with  a  point  mass 
P[x  =  0]  =  0.5.  ( b)-(d )  show  scatter  diagrams  illustrating  the  effect  of  censoring  on  the  OLS 
estimates:  (b)  is  the  uncensored  scatter  of  y*  against  x,  ( c )  is  the  censored  scatter  of  y  against  x, 
and  (d)  contains  the  two  regression  lines  (the  DGP  has  slope  =  1). 
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restriction  if  x,  takes  large  positive  values,  but  it  is  a  strong  restriction  if  x, 
takes  large  negative  values.  In  Exhibit  6.10  (c)  the  observed  values  y,  (with 
ji  >  0  always)  for  small  values  of  x,  are  systematically  larger  than  the 
corresponding  values  of  the  index  y*  in  (b)  (which  are  often  negative  for 
negative  values  of  x,j.  This  upward  bias  in  the  observations  y,-  in  the  left  part 
of  the  scatter  diagram  in  Exhibit  6.10  (c)  leads  to  a  downward  bias  in  the 
OLS  estimator  (see  (d)). 


Derivation  of  the  distribution  of  a  censored  dependent  variable 

For  given  values  of  x,-,  the  distribution  of  y,  in  the  tobit  model  is  mixed  continuous- 
discrete,  with  continuous  density  pVi(t)  for  outcomes  t  >  0  and  with  a  positive 
probability  on  the  discrete  outcome  y,  =  0.  We  will  now  derive  the  explicit 
expression  for  the  probability  distribution.  First  we  consider  the  discrete  part. 
As  the  density  f  of  the  error  terms  is  assumed  to  be  symmetric,  it  follows  that 
F(  —  t)  =  1  —  F(t),  so  that 


P{y,  =  0]  =  P[si  <  -x'iP/a]  =F(-  x!fijc)  =  1-  E(x'jS/<r). 


Second,  we  consider  the  continuous  part  for  y,  =  t  >  0.  For  t  >  0  there  holds 


Fy,(t)  =  T[y«  <  t] 


P\x'J)  +  (T8j  <t]=P 


t  -  x'JI 
a 


The  density  py,(t)  of  y,-  >  0  is  the  derivative  of  this  expression  with  respect  to  t  — 
that  is,  (1  /u)f({t  —  x'/b/u).  Summarizing  the  above  results,  the  probability  distri¬ 
bution  of  a  censored  variable  is  equal  to 


P\y,  =  0]  =  1  -  F(x'jP/a), 

Py,(t )  =  f°r  yi  =  t>  °- 


(6.32) 


Derivation  of  systematic  bias  of  OLS  for  censored  data 

We  now  investigate  the  effect  of  censoring  in  the  model  (6.31).  In  the 
standard  (uncensored)  regression  model  there  holds  x'Ji  =  E[y,],  but  this  does  not 
hold  true  for  the  censored  regression  model.  In  this  case  the  model  (6.31)  implies 
(as  before,  we  interpret  all  expressions  conditional  on  the  given  values  of  x,) 


E\y,]  =  0  •  P[y,  =  0]  +  P[y,  >  0]E[y,|y/  >  0J 
=  F(x'fi/a)(x'iP  +  <jE [e,|y,-  >  0J). 


(6.33) 
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Here  we  used  P[y,  >  0]  =  1  —  P[y,-  =  0]  =  F(x'fi/o),  see  (6.32).  In  the  expression 
(6.33)  there  holds  E\t:,\y,  >  0]  >  0  (as  in  the  case  of  truncated  samples)  and 
0  <  F(xji/a)  <  I.  So  E\y,]  may  be  larger  or  smaller  than  x'Ji —  that  is,  the  sign 
of  the  bias  term  (E[y,]  —  x'Ji)  may  depend  on  x,.  As  an  example,  let  f  =  (f>  be  the 
standard  normal  distribution.  Then  the  results  in  (6.27)  and  (6.33)  imply  that,  in 
this  case, 


E\y,]  =  <b(x'fi/a){xffi  +  akf)  =  <S>,x'J  +  a  fa,  (6.34) 


where 


ki  =  fax'iP/a)/$>(x'iP/(T) 

is  the  inverse  Mills  ratio  and  we  used  the  shorthand  notation  (j),  =  fax'fi  f  a)  and 
=  <f>(x'jf}/ rr).  Now  the  model  (6.31)  can  be  written  as  the  regression  equation 
y,  =  x'Ji  +  cnij,  where  r;,  =  e,  if  y*  >  0  (that  is,  if  e(-  >  —x'J/a)  and  r\i  =  —x'Ji /a 
if  J*  <  0  (that  is,  if  £,-  <  —x'fi/a).  The  result  in  (6.34)  shows  that,  for  given  value 
of  x,, 

£kl  =  1  (Ekl  -  m  = fI>'  ^  '  x'.P  +  fa- 

a  a 

In  general,  £[>/,]  0,  and  the  distribution  of  depends  on  x„  so  that  OTS  is 

inconsistent.  More  precisely,  define  ui,  =  +  ]Jl>‘xJi  —  f/>,,  then 

i'll  =  E [?/,•]  +  ujj  with  E[u>j]  =  0,  and  we  can  write  the  data  generating  process  as 

y,  =  x'fi  +  crjj,  =  x'J  +  (^(x'/l/cr)  -  l)x'/l  +  afaxlfi/a)  +  awi,  £[w,]  =  0. 

So  in  regressing  y,  on  x,  we  neglect  additional  regressors,  and,  as  the  omitted 
regressors  are  correlated  with  xt,  this  produces  a  systematic  bias  in  OLS 
(see  Section  3.2.3  (p.  142-3)  on  omitted  variables  bias).  As  the  regressors  x, 
are  not  orthogonal  to  the  error  term  of  the  regression  equation,  OLS  is  not 
consistent. 


Marginal  effects  in  the  tobit  model 

The  marginal  effects  of  the  explanatory  variables  in  the  tobit  model  can  be 
split  into  two  parts.  If  y,  =  0  and  x'Ji  increases,  then  the  probability  that 
y,  >  0  increases  —  that  is,  the  probability  of  a  positive  response  increases. 
Second,  if  y,  >  0,  then  the  mean  response  will  increase.  More  formally,  it 
follows  from  (6.33)  that 


dEjyi] 

dx, 


dPly,  >  0].£[y.|j,.  >  0]  +  F[y,  >  0]  8E[yil>’i  >  °J 


dx i 


dx, 


(6.35) 
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For  the  case  of  the  standard  normal  distribution,  the  first  term  is 
<pj{zt  +  ^i)P  and  the  second  term  is  0,(1  —  z,-A»  —  tf)P,  where  Zi  =  x'iP/o 
(see  Exercise  6.4).  Substituting  these  results  in  (6.35)  gives  for  the  tobit 
model 


dE[yi] 

dxj 


$  {x'fi/o)fi. 


So  in  this  case  the  marginal  effects  are  not  p  but  smaller,  with  reduction  factor 
0  <  O,  <  1.  The  difference  is  small  for  large  values  of  x'fi / a,  as  in  this  case 
O,  1,  but  the  difference  is  large  for  small  values  of  x'fi/a,  as  then  O,  ps  0. 
This  is  also  intuitively  clear.  The  condition  for  an  observation  y,  >  0  is  that 
£,  >  —(x'fi/a).  If  (x'lP/tj)  takes  a  large  positive  value,  then  this  is  hardly  a 
restriction,  so  that  y,  =  y*  =  x'fi  +  an.,  in  most  cases.  If  we  increase  x,  in  such  a 
situation,  then  the  marginal  effect  on  y,  will  (in  most  cases)  be  p.  On  the  other 
hand,  if  ( x\P/o )  takes  a  large  negative  value,  then  the  condition  s  i  >  -(x'iP/v) 
will  not  often  be  satisfied,  so  that  in  most  cases  y,  =  0.  A  marginal  increase  in 
Xj  will  have  no  effect  in  most  cases,  as  y,  =  0  still  has  a  large  probability. 


Estimation  by  maximum  likelihood 

The  parameters  of  the  tobit  model  can  be  estimated  consistently  by  maximum 
likelihood.  Assuming  that  the  observations  are  mutually  independent,  the  log- 
likelihood  log  (L)  =  log  (p(y,))  is  obtained  from  (6.32),  so  that 


log  ( L(P ,  a))  =  l°g  (1  —  F(x'jP/p)) 

{*;  y«=0} 


1 

2 


log(ff2)  +  log 


(6.36) 


If  we  substitute  f  =  cp  and  F  =  $  of  the  standard  normal  distribution,  then  this 
becomes 


log  (T)=  Y  log  (1  -  ^(x'iP/a)) 

{<;y/= 0} 


1 

2 


log(27i) 


The  term  for  the  observations  y,  >  0  is  as  usual,  and  the  first  term  corresponds  to 
the  contribution  of  the  observations  y,  =  0.  Note  that  this  term  differs  from  the 
truncated  sample  correction  term  in  (6.28).  The  tobit  estimates  are  obtained  by 
maximizing  this  log-likelihood  —  for  instance,  by  Newton-Raphson.  The  tobit 
estimators  have  the  usual  properties  of  ML. 
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Remark  on  the  censored  observations  (y,-  =  0) 

The  censored  data  are  mixed  continuous-discrete.  For  the  continuous  data 
yi  >  0,  the  regression  model  yl  =  x'fi  +  ere,-  applies.  However,  ft  should  not  be 
estimated  by  regressing  y,  on  x,-  on  the  subsample  of  observations  with  Vi  >  o, 
for  two  reasons.  First,  the  observations  with  y,  =  0  contain  relevant  infor¬ 
mation  on  the  parameters  ft  and  a,  as  is  clear  from  the  contribution  of  these 
observations  in  the  log-likelihood  (6.36).  Second,  in  the  subsample  of  obser¬ 
vations  with  y,  >  0  the  error  terms  do  not  have  zero  mean  as  they  come  from 
a  truncated  distribution.  The  results  in  Section  6.3.1  show  that  OLS  on  the 
truncated  sample  is  not  consistent. 


The  Heckman  two-step  estimation  method 

An  alternative  estimation  method  is  based  on  the  idea  that  censored  data 
can  be  seen  as  a  combination  of  a  binary  response  (with  possible  outcomes 
yi  =  0  and  y,  >  0),  followed  by  a  linear  relation  y,  =  x'fi  +  us,  on  the  trun¬ 
cated  sample  of  observations  with  y,  >  0.  For  the  tobit  model,  the  binary 
response  model  is  a  probit  model  with  P[y,  >  0]  =  <S>{x'fi/a)  and 
P[y,  =  0]  =  1  —  P[yi  >  0].  Define  the  parameter  vector  y  by  y  =  ( 1/cr)/?, 
and  let  y,  =  1  if  y,  >  0  and  y,  =  0  if  y,  =  0.  Then,  as  a  first  step,  y  can  be 
estimated  consistently  by  ML  in  the  probit  model  P[y,  =  1]  =  <F(x'y).  As  a 
second  step,  consider  the  truncated  sample  of  observations  with  y,  >  0.  The 
expected  value  of  y,  over  this  truncated  sample  —  that  is,  E[y,  \y*  >  0]  —  is 
given  by  (6.28).  If  we  use  the  notation  yf  for  the  random  variable  y„ 
conditional  on  the  information  that  y,  >  0,  this  can  be  written  as 


x'fi  +  <rlj  +  ujj. 


(j){xfi/o)  _  cf(x'y) 
<P>(xfi/ij)  O(x'y)’ 


E[ui\yi  >  0]  =  0. 


The  unobserved  regressor  X,  is  replaced  by  the  consistent  estimator  obtained 
by  substituting  the  probit  estimate  y  of  y,  so  that  A,  =  </>(x-y)/<3>(x-y).  Then 
OLS  in  the  above  equation  for  yf  on  the  truncated  sample  (with  y,  >  0)  gives 
consistent  estimators  of  the  parameters  fl  and  a.  This  is  called  the  Heckman 
two-step  method ,  which  can  be  summarized  as  follows. 


Heckman  two-step  estimation  method 

•  Step  1:  Estimate  the  bias  correction  term  by  probit.  Let  y,  be  the  binary 
variable  with  y,  =  1  if  y,  >  0  and  y,  =  0  if  y,  =  0.  Estimate  y  by  ML  in 
the  probit  model  P[y,  =  1J  =  <F(xJy).  Estimate  the  bias  correction  term  A,-  by 
A/  =  </>(x'y)/<F(x'y). 

•  Step  2:  Perform  OLS  in  model  with  the  estimated  bias  term  as  additional 
regressor.  Estimate  P  and  cr  by  applying  OLS  in  the  model  y,  =  x'fi  +  erA;  +  to,, 
using  only  the  observations  in  the  truncated  sample  with  y,  >  0. 
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This  two-step  estimation  method  is  relatively  simple  as  compared  to 
ML.  However  this  method  is  not  efficient,  both  because  the  two  separate 
steps  neglect  the  parameter  restrictions  y  =  (1/ a) fj  and  because  the  error 
terms  cu,  in  the  second  step  are  non-normal  and  heteroskedastic.  The 
Heckman  two-step  method  is  useful,  however,  to  obtain  consistent  initial 
estimates  for  ML. 


Diagnostic  tests 

The  reliability  of  censored  regressions  depends  crucially  on  the  underlying 
model  assumptions.  The  ML  tobit  estimators  become  inconsistent  in  case  of 
omitted  variables,  heteroskedasticity,  or  wrong  specification  of  the  distribu¬ 
tion  of  the  error  terms.  A  simple  specification  check  is  to  compare  the  probit 
estimate  of  y  (in  the  first  step  of  the  Heckman  method)  with  the  estimated 
values  of  ( 1/cr)/?  obtained  by  ML  in  the  tobit  model.  If  the  outcomes  are 
largely  different,  this  indicates  that  the  decision  whether  to  be  active  or  not 
may  be  driven  by  other  factors  than  the  magnitude  of  the  response  y;-  (given 
that  y,  >  0).  In  the  tobit  model,  both  the  decision  to  respond  and  the 
magnitude  of  the  response  are  modelled  in  terms  of  x'/L  In  the  next  section 
we  consider  models  where  the  decision  process  and  the  magnitude  of  the 
response  are  modelled  in  different  ways. 
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Example  6.7:  Direct  Marketing  for  Financial  Product  (continued) 

We  return  to  Example  6.6  of  the  foregoing  section  on  a  new  financial 
product.  In  Example  6.6  we  considered  only  the  customers  of  the  bank 
who  decided  to  invest  in  the  financial  product.  However,  we  also  know  the 
individual  characteristics  of  the  customers  who  decided  not  to  invest.  We 
will,  therefore,  construct  a  tobit  model  for  the  invested  amount  of  money.  We 
will  discuss  (i)  the  data,  (ii)  the  ML  estimates  of  the  tobit  model,  (iii)  a 
comparison  with  the  results  obtained  in  Example  6.6  for  the  truncated 
sample,  (iv)  the  estimates  obtained  by  the  Heckman  two-step  method,  and 
(v)  a  diagnostic  check  on  the  empirical  validity  of  the  tobit  model. 

(i)  The  data 

The  data  set  consists  of  925  individuals,  of  whom  470  responded  by  making 
an  investment  in  the  product  and  455  did  not  respond.  For  individuals  who 
responded,  the  amount  of  money  invested  in  this  product  is  known.  The 
explanatory  variables  (gender,  activity,  age)  are  known  for  all  925  individ¬ 
uals,  hence  also  for  the  individuals  that  did  not  invest  in  the  product.  So  the 
dependent  variable  is  censored,  not  truncated.  As  before,  we  take  as  depend¬ 
ent  variable  y,  =  log  (1  +  invest),  where  ‘invest’  is  the  amount  of  money 
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invested.  For  individuals  who  did  not  invest  (so  that  ‘invest’  is  zero),  we 
get  y,  =  0. 

(ii)  ML  estimates  of  the  tobit  model 

The  tobit  estimates  (ML  in  the  censored  regression  model)  are  in  Panel  2  of 
Exhibit  6.11.  For  comparison  this  table  also  contains  the  OLS  estimates  that 
are  obtained  if  the  censoring  is  erroneously  neglected  (see  Panel  1).  The  two 
sets  of  estimates  can  be  compared  by  the  implied  marginal  effects  —  that  is, 
the  estimates  themselves  in  the  OLS  model  and  the  average  multipliers 
in  the  tobit  model  obtained  by  averaging  bQ(x\b/s)  over  the  sample.  The 
tobit  multipliers  in  Panel  2  of  Exhibit  6.1 1  are  somewhat  larger  than  the  OLS 
multipliers.  The  variables  ‘gender’  and  ‘activity’  have  a  positive  effect  on 
the  amount  of  money  invested,  and  age  has  a  parabolic  effect,  with  a 
maximum  at  an  age  of  around  53  years  (namely,  where  0.196  —  2  ■  0.185  ■ 
(age/100)  =  0). 

(iii)  Comparison  of  tobit  estimates  with  results  for  truncated  sample 

We  compare  the  results  of  the  tobit  model  in  Panel  2  of  Exhibit  6.11  with  the 
results  for  the  truncated  sample  obtained  in  Example  6.6  (see  Panel  2  of 
Exhibit  6.9).  The  effect  of  ‘activity’  now  has  the  expected  positive  sign 
(instead  of  negative)  and  the  maximum  investments  are  around  an  age  of 
53  (instead  of  62).  Further,  the  tobit  estimates  indicate  higher  investments  by 
males  as  compared  to  females,  whereas  the  reverse  effect  was  estimated  in  the 
truncated  sample.  As  the  information  on  individuals  who  do  not  invest  is  of 
importance  in  describing  the  general  investment  behaviour,  the  results 
obtained  for  the  censored  sample  are  more  reliable  than  the  ones  for  the 
truncated  sample.  This  illustrates  the  general  point  that  it  is  always  advisable 
to  include  relevant  information  in  the  model.  The  truncated  model  of 
Example  6.6  neglects  the  information  on  non-investing  customers,  and  this 
makes  this  model  much  less  informative  than  the  tobit  model  for  the 
censored  data. 

(iv)  Heckman  two-step  estimates 

Panels  4  and  6  of  Exhibit  6.11  show  the  estimates  obtained  by  the  Heckman 
two-step  method.  This  gives  much  larger  standard  errors  and  less  significant 
results  than  ML.  Because  the  error  terms  ui,  in  the  second-step  regression  are 
heteroskedastic,  the  standard  errors  in  Panel  6  are  computed  by  the  method 
of  White  (see  Section  5.4.2  (p.  324-5)).  The  estimated  bias  correction  terms 
A,  obtained  in  step  1  of  the  Heckman  method  are  much  larger  than  the  ones 
estimated  in  Example  6.6  (see  the  histograms  in  Exhibits  6.9  (d)  and 
6.11  (e)).  The  minimum  value  of  A,  is  now  0.41  (whereas  in  the  truncated 
model  all  values  are  around  0.00).  We  conclude  that  the  bias  terms  are 
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Panel  1:  Dependent  Variable:  LOGINVEST  =  LOG(l  +  INVEST) 

Method:  Least  Squares 

Included  observations:  925 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-1.110803 

0.956974 

-1.160745 

0.2460 

GENDER 

0.967838 

0.174920 

5.533017 

0.0000 

ACTIVITY 

0.874305 

0.198665 

4.400908 

0.0000 

AGE 

0.103027 

0.038399 

2.683045 

0.0074 

AGEA2/100 

-0.095076 

0.036670 

-2.592709 

0.0097 

R-squared 

0.069073 

S.E.  of  regression 

2.346782 

Panel  2:  Dependent  Variable:  LOGINVEST  =  LOG(l  +  INVEST) 

Method:  ML  -  Censored  Normal  (TOBIT),  left  censoring  at  0 

Included  observations:  925 

Convergence  achieved  after  7  iterations 

Variable 

Coefficient 

Std.  Error 

z-Statistic 

Prob. 

Average  multiplier 

C 

-5.936450 

1.975280 

-3.005371 

0.0027 

-4.096151 

GENDER 

2.126287 

0.360378 

5.900151 

0.0000 

1.479896 

ACTIVITY 

1.691490 

0.373128 

4.533269 

0.0000 

1.177277 

AGE 

0.195933 

0.079056 

2.478413 

0.0132 

0.136369 

AGEA2/100 

-0.184648 

0.075808 

-2.435718 

0.0149 

-0.128515 

Error  Distribution 

SCALE:  SIGMA 

4.159631 

0.155950 

26.67282 

0.0000 

R-squared 

0.060935 

S.E.  of  regression 

2.358299 

Left  censored  obs 

455 

Right  censored  obs 

0 

Uncensored  obs 

470 

Total  obs 

925 

(c) 


Series:  PHI 

Sample  1 1000 

Observations  925 

Mean 

0.696056 

Median 

0.859516 

Maximum 

0.998962 

Minimum 

3.11E-05 

Exhibit  6.1 1  Direct  Marketing  for  Financial  Product  (Example  6.7) 

Models  for  invested  amount  of  money  based  on  data  of  925  individuals  (470  made  an  invest¬ 
ment,  455  did  not  invest),  OLS  (Panel  1)  and  Tobit  model  (censored  regression,  Panel  2).  (c) 
shows  the  histogram  of  the  values  of  ^(x'bML/sML)  (with  sample  mean  0.696;  the  average 
multipliers  reported  in  Panel  2  for  the  Tobit  model  are  obtained  by  multiplying  the  estimated 
coefficients  by  this  factor). 


underestimated  in  the  truncated  sample.  As  the  values  of  X,  for  the  censored 
data  are  quite  large,  this  implies  that  OLS  on  the  truncated  sample  is 
seriously  biased.  Hence  also  the  truncated  ML  estimates  in  Panel  2  of  Exhibit 
6.9  are  biased,  as  these  estimates  are  nearly  the  same  as  OLS  for  these  data. 
For  instance,  the  effects  of  the  variables  ‘gender’  and  ‘activity’  in  the  (con¬ 
sistent)  second  step  of  the  Heckman  method  in  Panel  6  of  Exhibit  6.11  are 
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(d) 


Panel  4:  Dependent  Variable:  RESPONSE 
(1  =  does  invest,  0  =  does  not  invest) 

Method:  ML  -  Binary  Probit 
Included  observations:  925 
Convergence  achieved  after  5  iterations 

Variable  Coefficient  Std.  Error  z-Statistic  Prob. 


c 

-1.497584 

0.536822 

-2.789720 

0.0053 

GENDER 

0.588114 

0.096684 

6.082811 

0.0000 

ACTIVITY 

0.561167 

0.111572 

5.029656 

0.0000 

AGE 

0.041680 

0.021544 

1.934636 

0.0530 

AGEA2/100 

-0.040982 

0.020607 

-1.988730 

0.0467 

Panel  6:  Dependent  Variable:  LOGINVEST  =  LOG(l  +  INVEST) 

Method:  Least  Squares 

Sample(adjusted):  1  500  IF  INVEST>0;  Included  observations:  470 

White  Eleteroskedasticity-Consistent  Standard  Errors  &  Covariance 

Variable 

Coefficient 

Std.  Error 

t- Statistic 

Prob. 

C 

-0.628395 

5.993881 

-0.104839 

0.9165 

GENDER 

0.535657 

1.317266 

0.406643 

0.6845 

ACTIVITY 

0.489967 

1.065880 

0.459683 

0.6460 

AGE 

0.123239 

0.093687 

1.315437 

0.1890 

AGEA2/100 

-0.108606 

0.092892 

-1.169162 

0.2429 

LAMBDA 

1.951267 

3.380174 

0.577268 

0.5640 

R-squared 

0.048679 

S.E.  of  regression 

0.944855 

Exhibit  6.11  (Contd.) 

Eleckman  two-step  method,  probit  model  for  investment  decision  (step  1,  Panel  4),  histogram 
of  the  corresponding  values  of  X  (inverse  Mills  ratio  (e)),  and  OLS  on  the  truncated  sample  with 
X  as  additional  regressor  (step  2,  Panel  6). 


both  positive,  whereas  these  effects  are  negative  if  the  bias  correction  term  is 
neglected  (see  Panels  1  and  2  of  Exhibit  6.9  for  the  truncated  sample). 

(v)  A  diagnostic  check  on  the  tobit  model 

Finally,  as  a  diagnostic  check  we  compare  the  estimates  y  of  the 
probit  model  in  the  first  step  of  the  Heckman  method  (Panel  4)  with 
the  tobit  ML  estimates  (1  /s)b  (obtained  from  Panel  2).  Dividing  the 
values  of  b  by  s  =  4.160  gives  (after  rounding  to  three  digits)  the  values 
(l/s)b  =  (  —  1.427,  0.511,  0.407,  0.047,  —  0.044)'.  This  does  not  differ 
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much  from  y  =  (  —  1.498,  0.588,  0.561,  0.042,  —  0.041)\  So  there  is  no 
indication  that  the  factors  that  determine  the  decision  whether  or  not  to 
invest  would  be  any  different  from  the  factors  that  determine  the  amount  of 
invested  money.  This  supports  the  use  of  the  tobit  model. 

Exercises:  T:  6.4c,  6.5;  S:  6.9c-e;  E:  6.16. 


6.3.3  Models  for  selection  and  treatment  effects 

A  model  for  selection 

In  truncated  samples,  the  values  of  the  dependent  variable  are  observed  only 
in  a  certain  interval  (y,  >  0  in  the  standard  model).  More  generally,  let  z,  be  a 
selection  dummy  that  takes  the  value  z,  —  1  if  the  zth  individual  is  in  the 
sample  and  z,  =  0  if  the  individual  is  not  in  the  sample.  We  assume  that 
y,  =  x'jfi  +  as,  applies  for  all  individuals  (observed  and  unobserved),  and  that 
this  model  satisfies  all  the  standard  assumptions.  Then  the  observed  sample 
can  be  described  by 


yt  =  x'fi  +  <je,  if  Zi  =  1, 

y,  is  not  observed  if  z,  =  0. 


(6.37) 


OLS,  that  is,  regressing  y,  on  x,  for  the  observations  with  zt  =  1,  is  consistent 
if  and  only  if  the  selection  is  exogenous.  This  condition  will  he  violated  if  the 
selection  variable  z,  depends  on  the  error  term  s,.  This  is  the  case,  for 
instance,  in  truncated  regressions  where  z,  =  1  if  and  only  if  y,  >  0,  since  in 
this  case  zt  =  1  if  and  only  if  e,  >  —  (x!fi/o).  In  general,  OLS  on  selected 
samples  is  inconsistent  if  the  selection  process  is  endogenous  in  the  sense 
that  the  selection  dummy  zi  depends  on  the  error  term  £,-. 


The  tobit  (type  2)  model  for  selection  effects 

In  truncated  regression  models,  an  individual  is  unobserved  if  the  index 
function  y*  =  x'fi  +  cr£,  takes  negative  values.  That  is,  the  factor  x'Ji  that 
influences  the  probability  of  being  observed  is  the  same  as  the  factor 
that  influences  the  magnitude  of  the  response  y,  =  y*  for  y *  >  0.  In  some 
cases  these  factors  may  be  different.  For  instance,  the  decision  to  work  or 
not  may  be  based  on  considerations  other  than  the  number  of  hours  worked, 
and  the  decision  to  buy  a  durable  product  or  not  may  be  influenced  by  factors 
other  than  the  amount  of  money  spent  by  the  buyers  of  this  product. 
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Let  w,  be  a  set  of  variables  that  influences  the  chance  that  y,  is  observed 
(Zj  =  1)  or  not  (Zj  =  0).  A  possible  selection  model  is  as  follows. 


z,  =  1  if  u/j y  +  u>i  >  0, 
z,  =  0  if  w'ty  +  lj,  <  0. 


(6.38) 


The  combined  model  (6.37)  and  (6.38)  is  called  the  tobit  type  2  model.  It 
differs  in  the  following  respects  from  the  standard  (or  tobit  type  1)  model 
(6.31)  of  Section  6.3.2.  First,  in  the  tobit  type  1  model  the  dependent  variable 
is  censored  (with  y,  =  0  for  z,  =  0  and  y,  >  0  for  Zi  =  1),  whereas  in  the  tobit 
type  2  model  y,-  is  not  observed  for  z,  =  0  and  y,  can  take  both  negative  and 
positive  values  if  zt  =  1.  Second,  the  selection  variables  wt  are  (partly) 
different  from  the  regressors  x„  whereas  in  the  tobit  type  1  model 
w,  =  x„  y  =  P,  and  oj,  =  us,. 

It  is  assumed  that  the  data  set  consists  of  n  observations  of  the  variables 
(x,-,  w„  Zi),  whereas  the  dependent  variable  y,  is  observed  only  for  the  obser¬ 
vations  with  2:1  =  1.  For  instance,  we  may  have  data  of  n  individuals  of  whom 
some  have  a  job  (z,  =  1)  and  others  not  (2,  =  0).  If  the  dependent  variable  of 
interest  y,  is  the  wage  that  an  individual  with  characteristics  x,  would  nor¬ 
mally  earn,  then  y,  is  not  observed  for  the  individuals  without  a  job.  Relevant 
characteristics  x,  that  may  affect  wage  are,  for  instance,  age  and  education, 
and  factors  wt  that  may  affect  the  chance  that  an  individual  works  are,  for 
instance,  age,  education,  and  family  composition.  As  another  example,  y,  may 
be  the  price  of  the  new  car  bought  by  customer  i  during  an  action  period.  A 
relevant  explanatory  variable  x,  may  be  the  price  of  the  current  car  of  the 
customer,  and  w,  may  be  the  age  of  the  current  car  and  the  marketing  effort  for 
this  customer.  The  sales  revenue  y,  is  observed  only  for  the  customers  who 
decide  to  buy  a  new  car  (2/  =  1),  whereas  the  characteristics  (x,,  w,)  are 
known  for  all  customers. 


Distinction  between  truncated  and  censored  selection 

Until  now  we  have  assumed  that  the  dependent  variable  in  the  tobit  type  2 
model  is  truncated  in  the  sense  that  y,  is  not  observed  if  2,  =  0.  Sometimes 
one  assigns  instead  the  value  y,  =  0  if  2,  =  0,  so  that  the  dependent  variable 
becomes  censored  instead  of  truncated.  For  instance,  the  wage  of  non¬ 
working  people  is  zero,  and  the  amount  of  money  spent  by  non-buying 
customers  is  zero.  In  estimation  it  does  not  matter  which  convention  one 
follows  as,  conditional  on  2,  =  0,  the  fact  that  y,  =  0  is  a  matter  of  definition 
that  provides  no  additional  information.  However,  the  truncated  sample 
interpretation  is  often  more  natural,  since  in  this  case  y,  can  be  seen  as  the 
natural  response  that  corresponds  to  x,.  For  individuals  with  y,  =  0,  this 
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response  is  due  not  so  much  to  x„  but  to  w,  that  causes  Z,  =  0.  For  instance, 
for  non-working  individuals  the  wage  is  ‘zero’  because  they  do  not  to  work 
(Zi  =  0),  and  it  is  better  to  say  that  we  do  not  observe  the  wage  that  would 
normally  be  earned  by  individuals  with  the  same  characteristics  xt. 


Derivation  of  selection  bias  of  OLS 

The  regression  of  y,  on  x;  in  the  observed  sample  (with  Zi  =  1 )  provides  consistent 
estimates  if  the  error  terms  uu,  in  the  selection  equation  are  independent  from  the 
error  terms  e,  in  the  regression  model.  Otherwise  OLS  is  inconsistent.  To  investi¬ 
gate  this  in  more  detail,  we  assume  that  the  values  of  ( Wi ,  x,)  are  fixed  and  that  the 
error  terms  (w„  £,)  are  independent  for  different  observations,  with  joint  normal 
distribution  with  mean  zero,  variances  E[ujj]  =  1  and  E[zf]  =  1,  and  covariance 
E[uji£j]  =  p.  In  this  case 


As  was  discussed  in  Section  6.1.1,  the  variance  of  the  error  term  w,  should  be  fixed, 
as  otherwise  the  parameters  y  of  the  selection  equation  are  not  identified.  The 
variance  of  e,  should  also  be  fixed,  because  of  the  term  a  in  the  model  (6.37).  Let 
)/,  =  e,  —  pw;,  then  j/y  is  normally  distributed  with  mean  zero,  and,  since  £[>7;W,']  =  0, 
it  follows  that  and  uu,  are  mutually  independent.  Writing  £,-  =  pui,  +  t]n  it  there¬ 
fore  follows  that  E[sj\zi  =  1]  =  £[e,|tv(-  >  —  w\y\  =  p£[w,jw,'  >  — u/y].  According  to 
(6.27),  the  last  term  can  be  written  as  pA,-,  where  A,-  =  </>bt/y)/<l>(it/y).  This  shows 
that  for  observations  in  the  sample  (with  Zi  =  1)  there  holds 

E[ei\zi  =  1]  =  E\ujt\ujt  >  -w'p/]  =  phi. 

Also  note  that  in  the  observed  sample  (with  Zi  =  1)  x'fi  is  not  equal  to  the  mean 
of  ji,  as 


E\y,\z,  =  1]  =  x'fi  +  crE[£j\zi  =  1]  =  x'/i  +  paXi.  (6.39) 

Therefore  OLS  in  y,  =  x'fi  +  £,  is  inconsistent,  as  this  neglects  the  regressor  A,-, 
unless  p  =  0  —  that  is,  unless  the  selection  variable  Zi  is  independent  of  the  error 
term  £,. 


Derivation  of  log-likelihood  in  case  of  sample  selection 

The  parameters  (/f,  y,  cr,  p)  can  be  estimated  consistently  by  ML.  The  likelihood 
function  is  equal  to  the  joint  probability  distribution  of  the  dependent  variables  Zi 
(for  i  =  1,  -  ■  ■ ,  n)  and  y,  (for  Zi  =  1).  If  the  observations  are  assumed  to  be 
mutually  independent,  we  get  L  =  IIp^o}  P(z<)  II{i;z*=i}  PiVh  Zi  =  1),  and  as 
p(yi,Zi  =  1)  =  p(yi)P[Zi  =  l|y,]  it  follows  that 
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log (L(P,y,o,p))  =  Y  log (P(Zi))+  Y  l°g (^[Z/ =  !N) 

{r;  z/=0}  {/;  z,= 1} 

+  Y  i°g  (P(y>))- 

{<;**=!} 

The  last  term  in  this  expression  stands  for  the  contribution  of  the  observed 
values  y;  ~  N(x'/i,  a).  The  first  term  in  the  log-likelihood  can  be  evaluated 
by  using  the  fact  that  P[zi  =  0]  =  P[w,-  <  —  w\y]  =  <3>(  —  w'ty)  =  1  — 

For  the  second  term  in  the  log-likelihood  we  use  that  P[z,  =  1  |yd  =  P[uj,  > 
— where  y,-  =  x'/l  +  ere,',  so  that  (a;,-,  y,)  follows  the  bivariate  normal  distri¬ 
bution 


N 


0 


1  per 


O1 

It  follows  from  (1.22)  that  uj,\y,  ~  N(  £  (y,  —  x'/I),  1  —  p2),  so  that 

I W/  -  s  (y<  -  x'jP)  -iu\y  -  £  (y,-  -  x'jp) 


P[zi  =  1| yi\  =  PM  >  -^y|y<]  =  P 


V1~p2 


> 


-dr/  -  g  (y.-  -  x'iP) 

yfi -P2  / 


\/l  -P2 
^7  +  g  (y/  - 
v  \/l-P2  / 


(6.40) 


Because  of  these  results,  the  log-likelihood  of  the  model  with  selection  effects  can 
be  expressed  as  follows. 

log  (L(P,y,a,p))  =  Y  log(l 

{<;z,-=0} 

+  £  i„g  +  S  (*-*!« Y\ 


[*;2/=i}  V 


\A  - 1 


+  Y  (-yl°g(<72)  -yl°g(271)  -^(y/  -x',P)2\ 


{>;z«=l) 


Heckman  two-step  method 

Consistent  estimates  of  /?  can  again  also  be  obtained  by  means  of  a  Heckman 
two-step  method.  According  to  (6.39),  for  observed  values  of  y,  (that  is,  for 
Zt  =  1)  the  bias  term  is  equal  to  E[yi\zi  =  1]  —  x'fi  =  pcXi.  Let 
rji  =  y,  —  E\y,\zi  =  1];  then  we  can  write 


Ji  =  x'fi  +  pudi  +  ;/,,  £[;/,]  =  0  (for  z,  =  1). 
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The  two-step  method  is  similar  to  the  one  described  in  Section  6.3.2  for 
censored  data.  In  the  first  step  we  use  all  n  observations  (w„  Zi)  to  estimate 
the  parameters  y  of  the  probit  selection  model  (6.38).  Let  y  be  the  obtained 
estimates;  then  the  inverse  Mills  ratios  are  estimated  by  X,  =  (f^w'fy)  /  ^?{nXty). 
In  the  second  step  consistent  estimates  of  /i  and  pa  are  obtained  by  regressing 

y,  on  Xj  and  X„  using  only  the  subsample  of  observations  with  z,  =  1.  A  test 
on  the  significance  of  selection  bias  (that  occurs  only  if  p  ^  0)  can  be 
performed  by  testing  whether  the  coefficient  of  the  inverse  Mills  ratio  X,  is 
significant.  Since  the  error  terms ;/,  are  heteroskedastic,  the  conventional  OLS 
formulas  for  the  standard  errors  are  not  valid.  Consistent  standard  errors  can 
be  obtained  by  White’s  method  (see  Section  5.4.2  (p.  324-5)). 

Remark  on  the  explanatory  variables  w,  and  x, 

It  may  well  be  that  some  of  the  variables  ivi  that  affect  the  selection  variable 

z,  are  also  relevant  in  explaining  the  response  y,.  For  instance,  someone’s  age 
may  influence  the  decision  whether  to  work  or  not  and,  for  someone  who  is 
working,  it  may  also  affect  the  wage  level.  To  avoid  excessively  large  correl¬ 
ations  between  the  regressors  x,  and  A„  one  usually  requires  that  w ,  contains 
at  least  one  variable  that  is  not  present  in  x,. 

Model  for  treatment  effects 

The  above  selection  model  can  also  be  used  for  the  analysis  of  treatment 
effects.  Consider  the  model 


y,  =  x'/i  +  az,  +  ae„  £,■  ~  NID(0, 1), 

where  Zi  is  a  dummy  variable  with  the  value  0  (no  treatment)  or  1  (treatment). 
It  is  assumed  that  the  treatment  selection  can  be  described  by  (6.38).  For 
instance,  y,  may  be  the  amount  of  money  spent  in  a  store  and  the  treatment  z, 
may  indicate  whether  the  customer  owns  a  credit  card  for  the  store  or  not. 
The  coefficient  a  is  the  treatment  effect  —  that  is,  the  additional  purchases 
that  a  customer  makes  because  he  or  she  owns  a  credit  card  for  the  store. 


Derivation  of  treatment  bias  of  OLS 

If  the  treatment  effect  is  estimated  by  regressing  y,  on  x,-  and  Zi,  then  this  gives 
inconsistent  estimators  if  the  error  term  lo,  in  the  treatment  selection  is  correlated 
with  the  error  term  e,  in  the  regression  model.  To  analyse  this  in  more  detail,  let 
</>,-  =  (fiw'jy ),  <F,  =  4>(n/y),  and  A,-  =  </:>,/ <!>,;  then  under  the  same  assumptions  as 
before  we  have  E[sj\zi  =  1]  =  pXi.  Further,  E[sj\zi  =  0J  can  be  obtained  from  the 
fact  that 
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0  =  E[Si\  =  P[zi  =  l]E[si\zi  =  1]  +  P[zi  =  0]E[?,\z,  =  0] 
=  QiPk  +  (1  -  $/)£[£,'!%  =  0], 


so  that  E[£i\zi  =  0J  =  —ph $,-/(l  —  <&,•).  Combining  the  results  for  Zi  =  1  and 
Zi  =  0,  we  can  write 


£N  Zi]  =  Zi(pXj)  +  (1  - 


=  P 


h[Zi  -  <E>/) 

I  - 


pX,<&, 

1  -<f> 


*/)  -  (1 


Z/)$i 


For  given  x,  and  z,,  let  rj,-  =  £,-  —  E[e,  |z,].  Then  E[t^\zi]  =  0  and  £,-  =  +  E[e,-|z,-],  so 

that  the  model  with  treatment  effects  can  be  written  as 


yi  =  x'fi  +  azi  +  aE[si\zi]  +  ai]j 

,n  ,  .  MZi-%)  ,  pr  i  1  n 

=  xfi  +  cczi  +  pa  — —  +  arjj,  =  0. 

The  additional  term  E[si\zi\  is  the  treatment  bias  term.  Regression  of  y,  on  x,  and 
Zi  is  inconsistent  because  the  omitted  regressor  is  correlated  with  the  treat¬ 
ment  variable  z(-  (note  that  </),-,  and  <b,  also  depend  on  z().  OLS  is  consistent 
only  if  p  =  0  —  that  is,  if  the  random  effects  uj,  in  the  treatment  selection  are 
independent  of  the  random  effects  £;  in  the  outcome  of  y,.  For  instance,  if  individ¬ 
uals  with  higher  than  average  expenditure  (e,  >  0)  also  have  a  larger  than  average 
chance  of  owning  the  store’s  credit  card  (tu,-  >  0)  so  that  p  >  0,  then  OTS  will 
overestimate  the  treatment  effect.  This  is  because  the  OTS  estimate  of  a  will 
incorporate  part  of  the  effect  of  the  omitted  bias  term  that  has  a  coefficient 
pa  >  0  in  this  case. 


Estimation  of  treatment  effects  by  ML  and  by  the  Heckman 
two-step  method 

As  before,  consistent  estimates  can  be  obtained  by  MT.  Again,  we  assume  that  the 
values  of  (w,,  x,)  are  fixed,  or,  equivalently,  we  use  the  likelihood  function  condi¬ 
tional  on  the  observed  values  of  (w,,  x,).  The  likelihood  function  is  equal  to  the 
joint  probability  distribution  of  the  dependent  variables  Zi  and  y„  i  =  1  ,•••,«. 
If  the  observations  are  assumed  to  be  mutually  independent,  we  get 
E  =  n;=i  P(Vi)  P(Zi\yi)  and  hence 


n 


n 


log  (L( a,  p,  y,a,p))  =  Y:  }og  (p{yi))  +  ^  loS  (P^i\yi))- 

i=  1  i=  1 
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As  the  selection  model  is  again  given  by  (6.38),  the  log-likelihood  can  be  evaluated 
in  the  same  way  as  discussed  above  for  the  model  with  selection  effects.  By 
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using  (6.40)  for  P[zj  =  1| yi\  and  the  fact  that  P[zi  =  0|y,]  =  1  —  P[z,  =  1| y,], 
we  get 


log  (L(a,  P,y,a,p)) 


log  ^ 

{<;zi=l}  V 


d y  +  g  (y,  -  x'p  -  ay 
K  /i7?  j 


+  J2  log( 1  -fI> 


{r;  z,=0} 


+  g  (y.-  -  x'jP  -  a) 

v  v/l-P2  V 


n  f  \  1  1 

+  X!  (  -  2 log  )  “  2 log  (2^  ~  Iff1  “  X‘P 

i=  1  '  “ 


Maximum  likelihood  estimates  can  be  computed  by  maximizing  this  log- 
likelihood.  A  simpler  way  to  get  consistent  estimates  of  the  treatment  effect  a 
is  again  to  apply  a  Heckman  two-step  method.  As  before,  in  the  first  step  we 
use  the  observations  (w„  zi)  to  estimate  the  parameters  y  of  the  probit  selection 
model  (6.38).  This  gives  consistent  estimates  of  the  bias  term  l,(z,  —  <E>,) /( 1  —  <E>;). 
In  the  second  step  y,  is  regressed  on  the  variables  x,-,  Zi,  and  the  estimated  bias 
terms. 


The  overall  difference  between  treated  and  non-treated  subjects 

The  above  analysis  shows  that,  for  an  individual  with  characteristics  x„  the 
overall  difference  of  the  response  y,  between  treated  and  untreated  individ¬ 
uals  will  in  general  not  be  equal  to  a,  unless  the  treatments  are  applied 
randomly  over  the  sample  (so  that  p  =  0).  The  expression  for  the  overall 
difference  in  response  is 


E[yt\zi  =  1)  -E[yi\zi  =  0] 


a  +  pa 


Ad  -  o,) 
V  i-$,- 


MO  -  o,)\ 

l-o,  ) 


—  (X 


pal, 

i  -  d ' 


(6.41) 


Here  a  is  the  actual  treatment  effect.  If  a  =  0,  so  that  treatment  has  no 
actual  effect,  then  there  is  still  a  difference  between  treated  and  untreated 
individuals  if  p  ^  0.  For  instance,  if  p  >  0,  then  treated  individuals  already 
have  a  tendency  for  higher  responses,  since  £[e,-|Zj  =  1]  =  A  >  0  and 
E[sj\zi  =  0J  <  0  in  this  case.  If  we  neglect  this  bias  and  apply  OLS  of 
y,  on  x,  and  z„  then  in  this  case  (with  a  =  0)  we  might  in  general  find 
misleading  significant  values  for  a,  which  would  wrongly  suggest  that  treat¬ 
ment  matters. 
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Example  6.8:  Student  Learning  (continued) 

In  this  example  we  will  analyse  the  (treatment)  effects  of  additional  calculus 
courses  on  the  grades  that  students  obtain  in  intermediate  micro-  and  macro¬ 
economics.  For  this  purpose  we  consider  data  on  student  learning  that 
were  analysed  by  J.  S.  Butler,  T.  A.  Finegan,  and  J.  J.  Siegfried  in  their 
paper  ‘Does  More  Calculus  Improve  Student  Learning  in  Intermediate 
Micro-  and  Macroeconomic  Theory’,  Journal  of  Applied  Econometrics , 
13/2  (1998),  185-202.  Part  of  these  student  learning  data  was  used  as  a 
leading  example  in  Chapter  1  (see  in  particular  Example  1.1,  where  this  data 
set  was  introduced). 

We  will  discuss  (i)  the  data  and  the  model  for  the  grades,  (ii)  a  selection 
model  for  the  attained  level  of  calculus,  (iii)  the  results  for  grades  in  micro¬ 
economics,  and  (iv)  the  results  for  grades  in  macroeconomics. 

(i)  Data  and  model  for  the  grades 

We  are  interested  in  the  question  whether  more  calculus  improves  student 
learning  in  intermediate  micro-  and  macroeconomic  theory.  The  data 
consist  of  the  results  of  609  students  in  intermediate  microeconomics  and 
of  490  students  in  intermediate  macroeconomics  of  the  Vanderbilt  Univer¬ 
sity.  The  dependent  variable  y,  is  the  obtained  grade  (in  intermediate  micro¬ 
economics  or  macroeconomics).  These  grades  range  from  0  to  4,  and  the 
sample  mean  is  2.65.  The  explanatory  variable  of  interest  (z,)  is  the  level  of 
calculus  attained  by  the  student  prior  to  following  the  intermediate  economic 
theory  course.  This  variable  has  the  interpretation  of  a  treatment  variable. 
We  distinguish  two  levels  of  calculus,  ordinary  (3  or  4  credit  hours,  denoted 
by  z,  =  0)  and  high  (6  to  12  credit  hours,  denoted  by  Zi  =  1).  The  effect  of  the 
level  of  calculus  Zi  on  the  grades  y,  is  modelled  by  the  linear  relation 

y,  =  x'jfl  +  azj  +  (TEi,  Ei  ~  NID(0, 1). 

The  explanatory  variables  (z„  xf)  are  listed  in  Exhibit  6.12.  The  treatment 
variable  z,  is  denoted  by  ‘mathhigh’,  and  the  grade  deflator  is  used  to 
compensate  for  possible  differences  between  the  different  instructors  who 
graded  the  exams  in  intermediate  economic  theory. 

(ii)  Selection  model  for  level  of  calculus 

Because  of  possible  selection  bias,  a  direct  regression  of  the  grade  y,  in  the 
intermediate  theory  course  on  the  explanatory  variables  (x„  z,)  may  give  an 
inconsistent  estimate  of  the  (treatment)  effect  a  of  additional  courses  in 
calculus.  This  is  because  similar  aptitudes  and  interests  may  lead  students 
to  enrol  and  do  well  in  both  mathematics  and  economics.  The  level  of 
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(a) 

Xj  and  Zi 

Explanatory  variables  for  grade  in  economics 

C 

MATHHIGH 

GRADELOW 

GRADEHIGH 

GRDFINT 

GRMACROl 

GRMICROl 

FRESHMAN 

FEMALE 

SATMATH/100 

SATVERB/100 

Constant  term 

‘Treatment’,  Zi  =  0  if  3  or  4  credit  hours,  Zi  =  1  if  6  to  12  credit  hours 
Grade  in  last  calculus  course  if  3  or  4  credit  hours 

Grade  in  last  calculus  course  if  6  to  12  credit  hours 

Grade  deflator  of  instructors  in  intermediate  theory  course 

Grade  in  introductory  macroeconomics 

Grade  in  introductory  microeconomics 

Freshman  grade  point  average 

Gender  dummy,  1  for  females  and  0  for  males 

SAT  mathematics  score,  divided  by  100 

SAT  verbal  score,  divided  by  100 

(b) 

Wi 

Explanatory  variables  for  level  of  calculus 

c 

SATMATH/100 

FEMALE 

MAJORESH 

MAJORNAT 

ADVMATH1 

ADVMATH2 

ADVMATH3 

PHYSICS 

CHEMISTRY 

Constant  term 

SAT  mathematics  score,  divided  by  100 

Gender  dummy,  1  for  females  and  0  for  males 

1  if  expected  major  in  economics,  social  science,  or  humanity,  0  otherwise 
1  if  expected  major  in  natural  science,  0  otherwise 

1  if  1  year  of  high  school  advanced  maths,  0  otherwise 

1  if  2  years  of  high  school  advanced  maths,  0  otherwise 

1  if  >  2  years  of  high  school  advanced  maths,  0  otherwise 

1  if  physics  in  high  school,  0  otherwise 

1  if  chemistry  in  high  school,  0  otherwise 

Exhibit  6.12  Student  Learning  (Example  6.8) 


Explanatory  variables  (x„  Zi )  for  obtained  grade  in  economics  (a)  and  explanatory  variables  Wi 
for  attained  level  in  calculus  (b).  The  variable  MATHHIGH  is  the  treatment  variable  z,. 

calculus  chosen  by  the  student  is  explained  by  means  of  the  probit  model 
(6.38),  with  iOj  ~  N(0,  1)  and  with  the  explanatory  variables  w,  listed  in 
Exhibit  6.12  (b). 

(iii)  Results  for  grades  in  microeconomics 

The  parameters  (a, /?,  y,  er,  p)  of  the  joint  model  are  estimated  by  the  Heck¬ 
man  two-step  method.  In  the  first  step  we  estimate  y  by  a  probit  model  for 
Zi  in  terms  of  the  explanatory  variables  w,.  The  results  are  used  to  estimate 
the  bias  correction  terms  k,(Zi  —  0,-)/(  1  —  O,).  In  the  second  step,  the  grade 
y,  is  regressed  on  the  explanatory  variables  {x„  zi)  and  the  estimated 
bias  correction  terms  as  additional  regressor.  The  standard  errors  in  the 
second-step  regression  are  computed  by  the  method  of  White  (see  Section 
5.4.2  (p.  324-5)),  because  the  error  terms  in  this  regression  are  hetero- 
skedastic.  The  results  for  microeconomics  are  given  in  Exhibit  6.13  (a—c). 
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Panel  1:  Dependent  Variable:  MATHHIGH 

Method:  ML  -  Binary  Probit 

Included  observations:  609  (MATHHIGH  =  0(1)  for  224(385)  observations) 
Convergence  achieved  after  4  iterations 

Variable 

Coefficient 

Std.  Error 

z-Statistic 

Prob. 

C 

-3.233952 

0.694970 

-4.653367 

0.0000 

SATMATH/100 

0.443273 

0.099780 

4.442485 

0.0000 

FEMALE 

0.158684 

0.116379 

1.363505 

0.1727 

MAJORESH 

-0.214143 

0.137400 

-1.558537 

0.1191 

MAJORNAT 

0.386246 

0.178288 

2.166418 

0.0303 

ADVMATH1 

0.173933 

0.248937 

0.698701 

0.4847 

ADVMATH2 

0.878933 

0.253796 

3.463152 

0.0005 

ADVMATH3 

0.691171 

0.621522 

1.112063 

0.2661 

PHYSICS 

0.326966 

0.118983 

2.748005 

0.0060 

CHEMISTRY 

0.139247 

0.220568 

0.631311 

0.5278 

(b) 

80  -i 


60 


40 


20 


0 


Panel  3:  Dependent  Variable:  GRINTERMICRO 
Method:  Least  Squares;  Included  observations:  609 
White  Heteroskedasticity-Consistent  Standard  Errors  &  Covariance 
Variable  Coefficient  Std.  Error  t-Statistic 

— C  -1.313984  0.386882  -3.396346 

SELCORMICRO  0.022413  0.037716  0.594255 

MATHHIGH  0.987359  0.215678  4.577921 

GRADELOW  0.292158  0.065798  4.440232 

GRADEHIGH  0.060555  0.051076  1.185590 

GRDFINTMICRO  0.839083  0.102980  8.148048 

GRMACROl  0.176453  0.052557  3.357358 

GRMICROl  0.290380  0.046338  6.266522 

FRESHMAN  0.324305  0.101163  3.205755 

FEMALE  0.082313  0.059692  1.378972 

SATMATH/100  0.088795  0.054408  1.631999 

SATVERB/100  0.055464  0.041474  1.337304 

Exhibit  6.13  Student  Learning  (Example  6.8) 

Heckman  two-step  estimate  of  the  effect  of  additional  courses  in  calculus  (MATHHIGH,  0/1) 
on  grades  in  intermediate  microeconomics,  probit  model  for  level  of  calculus  (step  1,  Panel  1) 
with  histogram  of  estimated  bias  terms  (denoted  by  SELCORMICRO,  i.e.  k,(z,  —  4>,)/(  1  —  4>,) 
in  ( b ))  and  OLS  in  model  with  this  bias  correction  term  included  (step  2,  Panel  3). 


Prob. 

0.0007 

0.5526 

0.0000 

0.0000 

0.2363 

0.0000 

0.0008 

0.0000 

0.0014 

0.1684 

0.1032 

0.1816 


-2.0  -1.5  -1.0  -0.5  0.0  0.5  1.0  1.5 
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Panel  4:  Dependent  Variable:  MATHHIGH 

Method:  ML  -  Binary  Probit 

Included  observations:  490  (MATHHIGH  =  0(1)  for  167(323)  observations) 
Convergence  achieved  after  4  iterations 

Variable 

Coefficient 

Std.  Error 

z-Statistic 

Prob. 

C 

-2.534435 

0.786598 

-3.222019 

0.0013 

SATMATH/100 

0.382440 

0.110370 

3.465072 

0.0005 

FEMALE 

0.095552 

0.133248 

0.717101 

0.4733 

MAJORESH 

-0.116509 

0.154365 

-0.754767 

0.4504 

MAJORNAT 

0.325960 

0.201677 

1.616247 

0.1060 

ADVMATH1 

-0.050933 

0.269884 

-0.188723 

0.8503 

ADVMATH2 

0.721165 

0.276831 

2.605076 

0.0092 

ADVMATH3 

0.336540 

0.661922 

0.508429 

0.6112 

PHYSICS 

0.330657 

0.133081 

2.484635 

0.0130 

CHEMISTRY 

0.061545 

0.284738 

0.216147 

0.8289 

(e) 


-2.0  -1.5  -1.0  -0.5  0.0  0.5  1.0 


Panel  6:  Dependent  Variable:  GRINTERMACRO 

Method:  Least  Squares;  Included  observations:  490 

White  Heteroskedasticity-Consistent  Standard  Errors  &  Covariance 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-0.086512 

0.366548 

-0.236019 

0.8135 

SELCORMACRO 

0.034281 

0.043518 

0.787743 

0.4312 

MATHHIGH 

0.021387 

0.070944 

0.301467 

0.7632 

GRDFINTMACRO 

0.919443 

0.161693 

5.686344 

0.0000 

GRMACROl 

0.206147 

0.054263 

3.798994 

0.0002 

GRMICROl 

0.307535 

0.050872 

6.045269 

0.0000 

FRESHMAN 

0.564736 

0.089483 

6.311126 

0.0000 

FEMALE 

0.028948 

0.063046 

0.459166 

0.6463 

SATMATH/100 

0.006629 

0.054836 

0.120892 

0.9038 

SATVERB/100 

-0.022063 

0.047101 

-0.468414 

0.6397 

Exhibit  6.13  (Contd.) 


Heckman  two-step  estimate  of  the  effect  of  additional  courses  in  calculus  (MATHHIGH,  0/1) 
on  grades  in  intermediate  macroeconomics,  probit  model  for  level  of  calculus  (step  1, 
Panel  4)  with  histogram  of  estimated  bias  terms  (denoted  by  SELCORMACRO,  i.e. 
h(Zi  —  0/)/(l  —  <t>,)  in  ( e ))  and  OLS  in  model  with  this  bias  correction  term  included  (step  2, 
Panel  6). 


The  explanatory  variables  Wj  in  the  probit  model  have  the  expected  signs. 
The  second-step  regression  in  Panel  3  indicates  that  the  selection  effects  (the 
bias  correction  term  denoted  by  ‘selcormicro’  in  Panel  3)  are  not  significant 
(P-value  0.55).  Further,  a  higher  level  of  calculus  has  an  estimated  payoff 
of  0.99  (the  coefficient  of  the  treatment  variable  denoted  by  ‘mathhigh’  in 
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Panel  3).  This  means  a  whole  letter  grade  (the  grades  in  the  sample  range 
from  0  to  4  with  sample  mean  2.65).  That  is,  additional  calculus  gives 
significantly  better  results  in  intermediate  microeconomics.  The  other  ex¬ 
planatory  variables  have  the  expected  signs. 

(iv)  Results  for  grades  in  macroeconomics 

The  last  three  panels  in  Exhibit  6.13  (d-f)  show  the  results  for  intermediate 
macroeconomics.  The  results  of  the  probit  model  are  comparable  to  that  for 
microeconomics.  The  second-step  regression  in  Panel  6  indicates  that  selec¬ 
tion  effects  (the  bias  correction  term  denoted  by  ‘selcormacro’  in  Panel  6)  are 
again  not  significant  (P-value  0.43).  Further,  a  higher  level  of  calculus  has  no 
payoff  (the  coefficient  of  ‘mathhigh’  in  Panel  6  is  0.02,  with  P-value  0.76). 
That  is,  additional  calculus  does  not  affect  the  results  in  intermediate  macro¬ 
economics.  The  other  explanatory  variables  have  the  expected  signs  (the 
variables  ‘gradehigh’  and  ‘gradelow’  are  omitted  because  of  missing  data 
for  this  group  of  students).  In  the  paper  of  Butler,  Finegan,  and  Siegfried  the 
above  questions  are  studied  in  more  detail,  with  more  explanatory  variables 
and  with  a  finer  distinction  of  the  attained  level  of  calculus  (with  seven 
ordered  levels  instead  of  the  above  two  levels).  A  further  analysis  of  the 
data  is  left  as  an  exercise  (see  Exercise  6.15). 

Exercises:  T:  6.4d;  E:  6.15b. 


6.3.4  Duration  models 

Duration  data 

A  duration  measures  the  amount  of  time  that  elapses  before  a  certain  event 
takes  place,  or  the  amount  of  time  that  has  passed  since  a  certain  event  took 
place.  Examples  are  the  time  it  takes  for  an  unemployed  person  to  find  a  job  (or 
the  time  of  unemployment  if  a  job  has  not  yet  been  found),  the  time  that  elapses 
between  two  purchases  of  the  same  product,  and  the  length  of  a  strike.  Exhibit 
6.14  (a)  shows  the  duration  (measured  in  days)  of  62  finished  strikes  —  that  is, 
the  number  of  days  between  the  start  and  the  end  of  strikes  (these  data  will  be 
further  described  and  analysed  in  Example  6.9  (p.  516)).  The  mean  duration  is 
43  days,  the  median  duration  27  days,  and  the  durations  are  positively 
skewed.  Exhibit  6.14  (b)  shows  the  histogram  of  the  logarithmic  strike  dur¬ 
ations,  which  are  more  normally  distributed. 

Censoring  aspect  of  duration  data 

In  many  cases  duration  data  are  right  censored  —  namely,  at  the  time  of 
measurement  the  duration  may  not  yet  be  finished.  This  is  the  case,  for 
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(a) 


(b) 


Series:  STREKEDUR 

Sample  1  62 

Observations  62 

Mean 

42.67742 

Median 

27.00000 

Maximum 

216.0000 

Minimum 

1.000000 

Std.  Dev. 

45.84070 

Skewness 

1.624063 

Kurtosis 

5.402414 

Jarque-Bera 

42.16496 

Probability 

0.000000 

Exhibit  6.14  Duration  data 


Histogram  of  strike  durations  (measured  in  days  (a) )  and  of  logarithm  of  strike  durations  (b). 


instance,  if  the  observed  person  is  still  unemployed,  if  the  observed  customer 
has  not  bought  the  product  again,  or  if  the  observed  strike  is  still  going  on.  It 
follows  from  the  results  in  Sections  6.3.1  and  6.3.2  that  the  application  of 
OLS  to  explain  durations  in  terms  of  explanatory  variables  is  inconsistent, 
for  two  reasons.  First,  some  of  the  durations  may  be  censored  (if  the  event 
had  not  taken  place  at  the  time  of  measurement).  Second,  if  the  sample  is 
restricted  to  the  durations  that  have  finished  (so  that  the  event  has  taken 
place  before  the  time  of  measurement),  then  the  effect  of  this  truncation 
should  be  taken  into  account. 


The  hazard  rate 

In  practice  the  main  interest  often  lies  in  the  question  of  how  long  the 
duration  will  continue,  given  that  it  has  not  finished  yet.  The  hazard  rate 
measures  the  chance  that  the  duration  will  end  now,  given  that  it  has  not 
ended  before.  In  the  above  examples,  this  can  be  interpreted  as  the  chance  to 
find  a  job,  to  purchase  a  product,  or  the  end  of  a  strike.  Duration  models  are 
expressed  in  terms  of  hazard  rates,  and  the  econometric  question  is  to 
estimate  hazard  rates  from  observed  duration  data.  Let  the  data  consist  of 
a  sample  of  n  durations  y\,  ■  ■  ■  ,yn.  It  is  assumed  that  these  durations  consist 
of  a  random  sample  from  a  population  with  density  function  f  and  corres¬ 
ponding  cumulative  distribution  function  F.  The  survival  function  S(t)  and 
the  hazard  rate  X[t)  are  defined  by 


S(t)  =  Ply,  >  t]  =  1  -  F(t), 


m 


lim(jl0 


P[t  <  y,  <  t  +  d  |  y,  >  t\ 

8 


Instead  of  estimating  the  density  function  f  of  the  durations,  one  usually 
estimates  the  hazard  rate  2,  as  this  is  of  more  practical  interest.  The  survival 
function  and  the  density  function  can  then  be  obtained  as  follows  from  the 
hazard  rate.  Since 
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_  f(t)  _  d\og(S{t)) 
S(t)  dt 


it  follows  that 


m  =  ms(t),  sa)  =  e~  foMs)ds 


(6.42) 


(see  Exercise  6.6). 

Models  for  the  hazard  rate 

Several  models  for  the  hazard  rate  can  be  formulated  according  to  whether 
the  hazard  is  constant  or  increases  or  decreases  over  time.  For  instance,  the 
probability  of  finding  a  job  or  purchasing  a  product  may  be  constant  over 
time,  but  it  may  also  increase  or  decrease  as  time  progresses.  The  model  with 
constant  hazard  rate 


X(t)  =  y  (for  all  t) 

corresponds  to  the  density  f(t)  =  ye~yt —  that  is,  the  exponential  distribu¬ 
tion.  This  is  called  the  exponential  hazard  model.  Several  models  with 
positive  or  negative  time  dependence  can  be  formulated.  For  example,  the 
Weibull  distribution  with  density  fit)  =  o.yta~ 1  e~yt°  corresponds  to  the 
(Weibull)  hazard  model 


X{t)  =  cuyf  1  (6.43) 

(see  Exercise  6.6).  In  this  model,  the  hazard  rate  increases  over  time  if  a  >  1, 
it  decreases  if  a  <  1,  and  it  remains  constant  if  a  =  1.  It  may  also  be  that  the 
hazard  rate  first  increases  and  later  decreases.  This  can  be  modelled,  for 
example,  by  the  log-normal  distribution,  where  the  log-duration  log  (y,)  is 
normally  distributed  with  mean  / 1  and  variance  a1.  The  corresponding 
hazard  rate  is  given  by 


m 


at 


(6.44) 


(see  Exercise  6.6).  In  this  case  the  hazard  rate  first  increases  and  later 
decreases,  with  turning  point  given  by  the  solution  of  the  equation 
taX(t)  =  a  +  ( log  (t)  —  n)/o  (see  Exercise  6.6).  Exhibit  6.15  shows  graphs 
of  some  of  the  above  hazard  rates. 
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(c)  (d) 


Hazard  rates,  exponential  with  constant  hazard  rate  X(t)  =  7=1  (a),  Weibull  with  y  =  1  and 
X(t)  =  at “_1  with  a  =  1.5  ((b),  increasing  hazard  rate)  and  wither  =  0.5  ((c)),  decreasing  hazard 
rate),  and  hazard  rate  corresponding  to  the  log-normal  distribution  with  jj.  =  2  and  a  =  1  (d). 


Proportional  hazard  model 

The  hazard  rates  may  not  be  the  same  for  all  individuals  and  may  depend  on 
individual  characteristics.  Let  l,(t)  be  the  hazard  rate  that  applies  to  the  zth 
duration  y,.  We  assume  that  the  individual  hazard  rates  can  be  expressed  as 
k,(t)  =  giX(t),  where  the  factor  g,  >  0  stands  for  the  individual-specific 
effects.  If  these  effects  are  modelled  by  g,  =  ex^,  where  x,  are  observed 
variables  that  affect  the  hazard  rate,  then 

ki(t)  =  ex'fX(t). 

This  is  called  the  proportional  hazard  model.  As  the  baseline  hazard  rate  X(t) 
often  contains  a  scale  parameter,  the  constant  term  should  be  excluded  from 
xt.  If  we  take  logarithms,  we  get 


6.3  Limited  dependent  variables  515 


log  (A, -(f))  =  x'fi  +  \og[X{t)), 

so  that  the  log-hazard  depends  linearly  on  the  explanatory  variables.  This 
resembles  the  linear  regression  model  somewhat,  but  a  crucial  difference  is 
that  the  log-hazard  is  not  directly  observed.  The  individual  characteristics  x, 
are  assumed  to  be  constant  over  time.  The  parameters  ft  measure  the  mar¬ 
ginal  relative  effects  of  the  explanatory  variables  on  the  hazard  rate  —  that  is, 

=  <91og(d,(f))  =  1  d  X,(t) 

dx,  X,(t)  dxj 

The  survival  function  of  the  proportional  hazard  model  is  given  by 
Si(t)  =  [5(t)]e '  ,  where  S(t)  is  the  survival  function  of  the  baseline  hazard 
rate  X(t).  For  larger  values  of  x'/i  the  hazard  X,(t)  increases,  and  as 
0  <  S(t)  <  1  the  survival  function  S,(t)  decreases  in  this  case.  That  is,  larger 
values  of  x'fi  correspond  on  average  to  shorter  durations.  If  the  baseline 
hazard  rate  X(t)  corresponds  to  the  Weibull  distribution  in  (6.43),  then  the 
expected  durations  of  the  proportional  hazard  model  are  given  by 
E[y,]  =  e~x‘^afi0,  where  /i0  is  the  expected  duration  of  the  baseline,  hazard 
with  x,  =  0  (see  Exercise  6.6). 


Estimation  of  the  hazard  rate  model  by  maximum  likelihood 

The  parameters  of  a  (proportional)  hazard  model  can  be  estimated  by  maximum 
likelihood.  To  derive  the  likelihood  function,  it  should  be  realized  that  some  of  the 
observed  durations  y\,  ■  ■  • ,  yn  may  be  finished  (the  person  found  a  job,  or  made  a 
purchase,  or  the  strike  ended,  indicated  by  z,  =  1),  but  others  may  be  censored 
(the  person  is  still  unemployed,  or  has  still  not  made  a  purchase,  or  the  strike  still 
continues,  indicated  by  Zi  =  0),  so  that  the  finished  duration  will  be  larger  than  y,. 
The  probability  that  the  ith  duration  is  still  unfinished  is  given  by 
P[z,  =  0]  =  S,(y,j,  and  the  density  for  finished  durations  is  given  by  pi(y,)  = 
A,(y,)Sj(yi).  Assuming  that  the  n  observations  are  mutually  independent,  the  log- 
likelihood  is  therefore  given  by 


log  (T)  =  Y  lo§  (Pi(y<))+  Y  lo§  (si(y>)) 

l<;Zi=l}  {i;Zi= 0} 

n 

=  Y  lo§ (MyiV+Y^s^y^ 

[i\Zi=  1}  i=  1 

=  Y  (^  +  l°g(A(y,-)))  -Y(^^  [  ^6-45* 

j«;Zi=l)  /=  1  V  40  / 

Here  we  used  the  fact  that  S,-(f)  =  [S(f)]e  ’  ,  so  that  log(S,(y,))  =  ^  log  (S(y,))  = 
—  exXs  fg'  X(t)dt.  The  log-likelihood  (6.45)  can  be  maximized  to  obtain  ML  esti¬ 
mates  of  the  parameters  /?  and  of  the  parameters  of  the  baseline  hazard  rate  X(t).  For 
instance,  for  a  constant  baseline  hazard  rate  X(t)  =  y,  the  log-likelihood  becomes 
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log  (L(p,  y))=  Y  ( x'iP  +  log  (y))  -  Y  Vi)  ■ 

{»';®=1}  i=  1 

The  estimated  model  can  be  visualized  by  making  plots  of  the  estimated  hazard 
rate  functions  Xft)  against  the  time  variable  t,  for  different  choices  of  the  values  of 
the  explanatory  variables  x,. 


Diagnostic  checks  on  hazard  rate  models 

To  describe  a  test  on  the  correct  specification  of  the  model,  let  m  <  n  be  the 
number  of  finished  duration  times  in  the  sample.  Suppose  that  the  data  are 
ordered  so  that  the  first  m  observations  are  finished  and  the  remaining  n  —  m 
observations  are  censored.  Then  the  generalized  residuals  are  defined  in 
terms  of  the  survival  function  by 


e,-  =  -log  (Sj(y,))  =  ex'-P  [  A(t)dt ,  i=l,---,m, 

Jo 

where  the  ML  estimates  are  substituted  for  [1  and  for  the  parameters  of  the 
baseline  hazard  rate  X.  If  the  model  is  correctly  specified,  then  the  random 
variable  S,  =  Sj(yi)  has  a  uniform  distribution  on  the  unit  interval  and 
e,  follows  the  unit  exponential  distribution  with  density  e~f  (see  Exercise 
6.17).  Note  that  this  result  does  not  depend  on  the  functional  form  of  the 
hazard  rate. 

If  the  model  is  correctly  specified,  then  the  sample  cumulative  distribution 
function  of  the  outcomes  Sj(y,),  i  =  1,  •  •  • ,  m  should  be  close  to  the  45°  line. 
Alternatively,  the  sample  distribution  of  the  generalized  residuals  may  be 
compared  with  the  unit  exponential  distribution.  The  (uncentred)  sample 
moments  Y^i=  l  ei  !m  can  be  compared  with  the  corresponding  moments  of 
the  unit  exponential  distribution  that  has  kth  population  moment 
kl  =  k  ■  (k  —  1)  ■  ■  ■  2  ■  1  (see  Exercise  6.17).  If  the  sample  contains  no 
censored  observations  —  that  is,  if  m  =  n  —  then  ML  gives  J21=i  ei/n  —  15 
so  that  the  comparison  should  be  based  on  the  second  and  higher  order 
moments  (see  Exercise  6.17). 


Example  6.9:  Duration  of  Strikes 

We  consider  data  on  the  duration  of  contract  strikes  in  US  manufacturing. 
The  data  are  taken  from  J.  Kennan,  ‘The  Duration  of  Contract  Strikes  in 
US  Manufacturing’,  Journal  of  Econometrics,  28  (1985),  5-28.  We  will 
describe  (i)  the  data,  (ii)  results  of  the  log-normal  distribution  for  the 
strike  durations,  (iii)  the  results  of  different  hazard  models,  (iv)  a  diagnostic 
check  in  terms  of  the  generalized  residuals,  and  (v)  the  effect  of  censoring. 
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(i)  The  data 

The  data  set  consists  of  n  —  62  durations  y,  (measured  in  days)  of  finished 
strikes.  A  histogram  of  these  durations  is  given  in  Exhibit  6.14  (a).  As  possible 
explanatory  variable,  an  indicator  x,  of  general  economic  activity  during  the 
strike  is  used.  Exhibit  6.16  (a)  shows  a  histogram  of  this  production  indicator 
and  (b)  a  scatter  diagram  of  log-durations  against  this  indicator.  This  indicates 
that  strikes  may  last  longer  if  economic  conditions  are  worse. 

(ii)  Log-normal  distribution  for  the  durations 

The  histogram  of  the  log-durations  log  (y,)  in  Exhibit  6.4  (b)  shows  that  the 
null  hypothesis  of  normality  is  not  rejected,  as  the  Jarque-Bera  test  for 
normality  has  a  P-value  of  P  =  0.23.  This  motivates  the  use  of  the  log-normal 
density  for  the  strike  durations.  The  sample  mean  of  the  log-durations  is 
=  3.10  and  the  sample  standard  deviation  is  a  =  1.29  (see  also  Panel  3  in 
Exhibit  6.16).  The  expected  duration  time  of  strikes  is  then  estimated  as 
E[ji]  =  E[elogb')]  =  =  52  days  (using  the  result  in  Exercise  5.2  (e)  for 


(a) 


(b) 


PROD 


Panel  3:  Least  Squares;  Dependent  Variable:  LOG(STRIKEDUR) 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

3.104456 

0.164406 

18.88284 

0.0000 

S.E.  of  regression 

1.294536 

Panel  4:  Least  Squares;  Dependent  Variable:  LOG(STRIKEDUR) 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

3.205657 

0.160988 

19.91236 

0.0000 

PROD 

-9.180774 

3.404293 

-2.696822 

0.0091 

S.E.  of  regression 

1.232705 

Exhibit  6.16  Duration  of  Strikes  (Example  6.9) 


Histogram  of  a  production  index  (a),  scatter  diagram  of  strike  durations  (in  logarithms) 
against  the  production  index  ( b ),  regression  of  strike  durations  (in  logarithms)  on  a  constant 
(Panel  3)  and  on  a  constant  and  the  production  index  (Panel  4). 
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Panel  5:  ML:  EXPONENTIAL  without  explanatory  variable 

Parameter 

Coefficient 

Std.  Error  z-Statistic 

Prob. 

GAMMA 

0.023432 

0.002793  8.389152 

0.0000 

Log  likelihood 

-294.7275 

Akaike  info  criterion 

9.539598 

Panel  6:  ML:  EXPONENTIAL  with  production  index  as  explanatory  var. 

Parameter 

Coefficient 

Std.  Error 

z-Statistic 

Prob. 

GAMMA 

0.022902 

0.003185 

7.189665 

0.0000 

BETA  (coef  PROD) 

9.333815 

2.977868 

3.134395 

0.0017 

Log  likelihood 

-289.7647 

Akaike  info  criterion 

9.411765 

Panel  7:  ML:  WEIBULL  without  explanatory  variable 

Parameter 

Coefficient 

Std.  Error 

z-Statistic 

Prob. 

ALPHA 

GAMMA 

0.924688 

0.032183 

0.111835 

0.015191 

8.268300 

2.118483 

0.0000 

0.0341 

Log  likelihood 

-294.4027 

Akaike  info  criterion 

9.561377 

Panel  8:  ML:  WEIBULL  with  production  index  as 

explanatory 

var. 

Parameter 

Coefficient 

Std.  Error 

z-Statistic 

Prob. 

ALPHA 

1.007855 

0.122542 

8.224586 

0.0000 

GAMMA 

0.022160 

0.011101 

1.996139 

0.0459 

BETA  (coef  PROD) 

9.405522 

3.071885 

3.061808 

0.0022 

Log  likelihood 

-289.7617 

Akaike  info  criterion 

9.443926 

(i) 


(i) 


(k) 


1.0 

0.8 

0.6 

0.4 

0.2 

0.0 


Exhibit  6.16  ( Contd .) 


Estimated  hazard  rate  models  without  explanatory  variable  (exponential  model  in  Panel  5 
and  Weibull  model  in  Panel  7)  and  proportional  hazard  models  with  the  production  index 
as  explanatory  variable  (exponential  model  in  Panel  6  and  Weibull  model  in  Panel  8). 
(i)  shows  the  empirical  survival  function  of  the  durations  and  the  survival  function  of  the 
estimated  model  of  Panel  5,  (j)  shows  the  survival  function  of  the  estimated  proportional 
model  of  Panel  6  (for  given  value  PROD  =  0  of  the  index),  and  (k)  shows  the  scatter 
diagram  of  the  quantiles  of  the  generalized  residuals  of  the  model  of  Panel  6  (on  the 
horizontal  axis)  against  the  theoretical  quantiles  (of  the  unit  exponential  distribution, 
on  the  vertical  axis). 
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Panel  12:  TOBIT  (censored  normal);  Dep.  Var.:  LOG(STRIKECENS80) 
Right  censoring  at  value  80 

Variable 

Coefficient 

Std.  Error 

z-Statistic 

Prob. 

C 

3.026027 

0.150206 

20.14582 

0.0000 

Error  Distribution 
SCALE:  SIGMA 

1.182725 

0.106212 

11.13553 

0.0000 

S.E.  of  regression 

1.202275 

Akaike  info  criterion 

3.238035 

Panel  13:  TOBIT  (censored  normal);  Dep.  Var.:  LOG(STRIKECENS80) 
Right  censoring  at  value  80 

Variable 

Coefficient 

Std.  Error 

z-Statistic 

Prob. 

C 

3.113678 

0.146893 

21.19691 

0.0000 

PROD 

-7.951562 

3.106232 

-2.559874 

0.0105 

Error  Distribution 
SCALE:  SIGMA 

1.124777 

0.101008 

11.13556 

0.0000 

S.E.  of  regression 

1.153019 

Akaike  info  criterion 

3.169821 

IPanel  14:  ML:  EXPONENTIAL  hazard  (right  censoring  at  value  80) 

Parameter 

Coefficient 

Std.  Error  z-Statistic 

Prob. 

GAMMA 

0.023596 

0.003000  7.865414 

0.0000 

Log  likelihood 

-237.3338 

Akaike  info  criterion 

7.688188 

IPanel  15:  ML:  EXPONENTIAL  PROP,  hazard  (right  censoring  at  value  80) 

Parameter 

Coefficient 

Std.  Error  z-Statistic 

Prob. 

GAMMA 

BETA  (coef  PROD) 

0.021726 

10.18983 

0.003073  7.070767 

2.979147  3.420386 

0.0000 

0.0006 

Log  likelihood 

-232.5119 

Akaike  info  criterion 

7.564900 

(P) 


(q) 


IPanel  16:  ML:  WEIBULL  hazard  (right  censoring  at  value  80) 

Parameter 

Coefficient 

Std.  Error 

z-Statistic 

Prob. 

ALPHA 

0.890842 

0.123693 

7.202063 

0.0000 

GAMMA 

0.035919 

0.018397 

1.952408 

0.0509 

Log  likelihood 

-236.8438 

Akaike  info  criterion 

7.704639 

IPanel  17:  ML:  WEIBULL  PROP,  hazard  (right  censoring  at  value  80) 

Parameter 

Coefficient 

Std.  Error 

z-Statistic 

Prob. 

ALPHA 

0.950645 

0.127085 

7.480377 

0.0000 

GAMMA 

0.026287 

0.013733 

1.914126 

0.0556 

BETA  (coef  PROD) 

9.915412 

3.134831 

3.162981 

0.0016 

Log  likelihood 

-232.4195 

Akaike  info  criterion 

7.594179 

Exhibit  6.16  (Contd.) 

Hazard  models  (without  and  with  explanatory  variable)  for  strike  duration  data  censored 
at  a  maximum  of  eighty  days,  lognormal  hazard  models  (Panels  12  and  13,  corresponding 
to  tobit),  exponential  hazard  models  (Panels  14  and  15),  and  Weibull  hazard  models 
(Panels  16  and  17). 
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the  mean  of  a  log-normal  distribution).  The  result  of  regressing  log  (y,)  on  a 
constant  and  the  production  indicator  is  given  in  Panel  4  of  Exhibit  6.16.  The 
production  indicator  is  significant  (P  =  0.009)  and  has  the  expected  negative 
sign  (  —  9.18).  That  is,  the  average  duration  of  strikes  is  shorter  in  periods  of 
high  production  than  in  periods  of  low  production,  possibly  because  strikes 
are  relatively  more  costly  in  periods  of  high  economic  activity. 

(iii)  Results  of  different  hazard  models 

Next  we  estimate  exponential  and  Weibull  hazard  models  by  maximizing  the 
corresponding  log-likelihood  (6.45).  As  the  data  are  not  censored,  the  term  in 
(6.45 )  with  the  summation  over  {z,  =  1 }  runs  over  all  n  observations.  It  is  clear 
from  the  outcomes  in  Panels  5-8  of  Exhibit  6.16  that  the  hypothesis  of  a 
constant  hazard  rate  (a  =  1  in  the  Weibull  models)  is  not  rejected.  The  exhibit 
shows  the  survival  functions  S(t)  =  e~yt  of  the  exponential  hazard  model  (in 
(/'),  corresponding  with  Panel  5,  that  is,  without  x()  and  of  the  proportional 
exponential  hazard  model  (in  (/'),  corresponding  with  Panel  6,  that  is,  with 
explanatory  variable  x,;  the  plot  is  for  x,-  =  0).  Exhibit  6.16  (i)  and  (/) 
also  show  the  empirical  survival  function  —  that  is,  Semp(t)  = 
(number  of  y,  >  t)/n.  The  survival  functions  of  the  estimated  models  are 
quite  close  to  the  empirical  survival  function. 

To  obtain  somewhat  more  insight  into  the  estimated  proportional  exponen¬ 
tial  hazard  model  we  compute  the  expected  duration  of  strikes  for  three  values 
of  the  economic  indicator  —  that  is,  when  economic  activity  is  minimal 
(x,  =  —0.10),  neutral  (x,  =  0),  and  maximal  (x,  =  0.07).  The  expected  dur¬ 
ations  are  given  by  E[yi]  =  e~Px\ u0  =  e~^Xi/y.  This  gives  expected  durations  of 
111,  44,  and  23  days  respectively,  so  that  the  differences  are  quite  consider¬ 
able.  Further,  the  probability  that  a  strike  will  end  today  (that  is,  the  hazard 
rate)  is  ye?Xi  =  0.023e9  33x‘  —  that  is,  around  1  per  cent  if  economic  activity  is 
minimal,  2.3  per  cent  if  economic  activity  is  neutral,  and  4.5  per  cent  if 
economic  activity  is  maximal. 

(iv)  Diagnostic  check  in  terms  of  generalized  residuals 

Exhibit  6.16  (k)  shows  the  generalized  residuals  of  the  proportional  exponen¬ 
tial  hazard  model  —  that  is,  with  the  production  index  as  explanatory  variable 
so  that  e,  =  0.023y!e9  33xi.  In  Exhibit  6.16  (k),  the  quantiles  of  these  residuals 
e„  i  =  1,  ■  ■  ■ ,  62  are  compared  with  the  (62)  quantiles  of  the  unit  exponential 
distribution.  In  case  of  a  perfect  fit  these  quantiles  should  be  the  same,  and  the 
diagram  shows  that  the  deviations  are  not  large  as  the  plot  of  the  two  quantiles 
lies  close  to  the  45°  line.  This  provides  support  for  the  proportional  exponen¬ 
tial  hazard  model.  The  first  three  sample  moments  of  the  generalized  residuals 
are  respectively  1,  1.91,  and  4.88.  This  is  quite  close  to  the  corresponding 
population  moments  of  the  unit  exponential  distribution,  which  are  respect¬ 
ively  1,  2,  and  6. 
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(v)  The  effect  of  censoring 

As  discussed  before,  in  many  applications  part  of  the  durations  are  censored. 
To  illustrate  the  effect  of  censoring,  we  now  suppose  that  all  strikes  with  a 
duration  of  eighty  days  or  more  are  censored  at  the  value  of  80.  In  the  data 
set  there  are  twelve  such  durations.  For  the  log-normal  model  we  estimate 
tobit  models  (without  and  with  explanatory  variable)  with  known  censoring 
from  above  at  eighty  days.  The  outcomes  do  not  differ  much  from  those 
obtained  without  censoring.  We  also  estimate  the  exponential  and  Weibull 
hazard  models  (both  without  and  with  the  economic  indicator)  on  the 
censored  data  set.  In  this  case  the  first  term  in  the  log-likelihood  (6.45) 
runs  over  the  fifty  uncensored  durations  (with  y,-  <  80)  and  the  second 
term  runs  over  all  sixty-two  durations  (where  the  largest  twelve  durations 
all  have  the  value  y,  =  80).  The  results  in  Panels  12-17  of  Exhibit  6.16  show 
that  censoring  does  not  lead  to  significant  biases  in  the  involved  hazards.  As 
an  illustration  we  compare  the  proportional  exponential  hazard  models 
ki(t)  =  ye^Xi.  For  the  original  (uncensored)  data,  the  estimates  in  Panel  6  are 
y  =  0.023  (0.003)  and  ft  =  9.334  (2.978),  with  standard  errors  in  paren¬ 
theses.  For  the  censored  data,  the  tobit  estimates  in  Panel  15  are  y  =  0.022 
(0.003)  and  ft  =  10.190  (2.979).  Although  the  censored  data  obviously 
contain  less  information  than  the  original  data,  this  has  hardly  any 
effect  on  the  estimates  and  their  standard  errors  in  this  example.  The  differ¬ 
ences  are  so  small  because  the  exponential  hazard  model  provides  a 
good  description  both  for  shorter  (y,-  <  80)  and  for  longer  (y,  >  80)  strike 
durations. 

Some  further  aspects  of  these  data  are  left  as  an  exercise  (see  Exercise  6.17). 

Exercises:  T:  6.6;  E:  6.17. 


6.3.5  Summary 

We  considered  different  situations  where  the  dependent  variable  y,  in  the 
regression  equation  y,  =  x'fi  +  s,  is  limited  in  some  sense.  The  model 
should  be  based  on  the  relevant  type  of  limited  dependent  variable.  OLS 
is  not  consistent  for  this  type  of  data.  The  models  can  be  estimated 
consistently  by  maximum  likelihood.  In  some  cases  also  a  two-step 
method  is  possible,  where  in  the  first  step  the  bias  term  of  OLS  is  estimated 
and  in  the  second  step  y,  is  regressed  on  x,  and  the  estimated  bias  term.  We 
paid  attention  to  the  following  types  of  data. 

•  In  truncated  samples  the  observed  data  come  from  a  selective  part  of  the 
population.  This  can  be  modelled  in  terms  of  truncated  distributions  for 
the  error  term.  The  bias  term  of  OLS  can  be  expressed  in  terms  of  the 
inverse  Mills  ratio. 


522 


6  Qualitative  and  Limited  Dependent  Variables 


•  Censored  data  arise  if  the  dependent  variable  cannot  take  values  below 
(or  above)  a  certain  threshold  value.  This  can  be  modelled  by  means  of 
the  tobit  (type  1)  model  in  terms  of  distributions  that  are  mixed  continu¬ 
ous-discrete.  The  continuous  part  applies  to  the  non-censored  out¬ 
comes,  and  the  discrete  part  to  the  censored  outcomes.  The  tobit 
model  can  be  estimated  by  ML  or  by  a  Heckman  two-step  method. 

•  Sometimes  the  selection  process  that  determines  which  data  are  ob¬ 
served  can  be  modelled  by  means  of  a  probit  model  with  additional 
explanatory  variables.  This  is  called  the  tobit  type  2  model,  which  can 
be  estimated  by  ML  or  by  a  two-step  Heckman  method.  This  model  is 
also  useful  to  estimate  treatment  effects  in  case  the  assignment  of 
treatments  is  not  blind  but  correlated  with  the  dependent  variable  y,. 

•  A  duration  variable  measures  the  time  that  elapses  before  a  certain  event 
takes  place.  Such  a  variable  can  take  on  only  non-negative  values.  The 
observed  values  are  censored  if  the  relevant  event  has  not  yet  taken  place 
at  the  time  of  observation.  Durations  are  modelled  in  terms  of  hazard 
rates  that  may  depend  on  relevant  explanatory  variables. 

•  Some  care  is  needed  to  interpret  the  estimated  parameters  in  models  for 
truncated  or  censored  data.  It  is  more  informative  to  determine  (aver¬ 
age)  marginal  effects.  The  marginal  effects  differ  from  (and  are  actually 
smaller  than)  the  estimated  parameters.  This  is  caused  by  the  fact  that 
truncation  and  censoring  lead  to  non-linearities  in  the  observed  rela¬ 
tions  between  yt  and  xt.  The  difference  E[y;]  —  x'/l  is  called  the  bias 
correction  term,  and  we  derived  explicit  expressions  for  this  term. 
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SUMMARY 

This  chapter  discussed  econometric  models  for  dependent  variables  that  are 
restricted  in  their  domain  of  possible  outcomes.  Data  of  this  type  become 
more  and  more  widespread  in  empirical  economic  work  on  choices  of 
individual  economic  agents.  In  the  year  2000,  the  Nobel  prize  in  economics 
was  awarded  to  McFadden  and  Heckman,  two  pioneers  in  the  econometric 
modelling  of  this  type  of  economic  data.  For  binary  data  we  discussed  logit 
and  probit  models,  with  extensions  for  (unordered  or  ordered)  multinomial 
data.  Truncated  and  censored  data  can  be  described  by  truncated  and  mixed 
continuous-discrete  probability  distributions,  in  particular  tobit  models.  We 
also  described  models  for  duration  data  and  methods  to  estimate  the  hazard 
rate.  All  models  discussed  in  this  chapter  can  be  estimated  by  maximum 
likelihood,  and  in  some  cases  regression  methods  can  be  used  by  incorpor¬ 
ating  bias  correction  terms.  As  the  models  are  non-linear,  the  marginal  effects 
of  the  explanatory  variables  on  the  dependent  variable  are  in  general  not 
constant  over  the  population.  We  derived  expressions  for  these  marginal 
effects.  Further  we  paid  attention  to  the  intuitive  interpretation  of  the  models 
and  to  methods  for  testing  the  empirical  adequacy  in  practical  applications. 
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the  Handbook  of  Econometrics.  Further  there  are  many  textbooks  that  deal 
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THEORY  QUESTIONS 

6.1*  (“®>  Section  6.1.4) 

Consider  the  binary  logit  model  with  heteroskedas- 
tic  error  terms  where  a,  =  e~^  (here  we  use  this 
expression  instead  of  cr,-  =  e^y  of  the  text  as  this 
simplifies  some  of  the  derivations,  both  models  are 
of  course  equivalent  by  reversing  the  signs  of  all 
entries  of  the  parameter  vector  y).  By  9  =  (ft  y)  we 
denote  the  vector  of  unknown  parameters  of  the 
model.  We  derive  the  LM-test  for  the  null  hypothesis 
of  homoskedasticity  (y  =  0). 

a.  Compute  the  score  vector  s(9)  consisting  of  the 
subvectors  9  log  (L)/dfS  and  <91og  (L)/9y.  Evalu¬ 
ate  s(9)  at  the  ML  estimators  under  the  null  hy¬ 
pothesis  that  y  =  0. 

b.  Compute  the  Hessian  matrix  H(9)  consisting  of 
the  second  order  derivatives  of  log  (L)  with  re¬ 
spect  to  the  parameters  (j 6,  y).  Next  compute  the 
information  matrix  ln(9)  =  —E[H(9)],  evaluated 
at  the  ML  estimators  under  the  null  hypothesis 
that  y  =  0. 

c.  Show  that  the  LM-test  LM  =  s'lfts  amounts 
in  large  enough  samples  to  LM  =  nR1  of  the 
auxiliary  regression  (6.14),  in  the  sense  that 
plimft'Tfts  —  nR1)  =  0. 

6.2  (=®  Sections  6.1.2,  6.1.3) 

In  direct  mailings  the  fraction  of  respondents  is 
often  relatively  small.  To  limit  the  database,  one 
sometimes  keeps  only  a  fraction  ft  of  the  respond¬ 
ents  (with  y,  =  1)  and  a  fraction  ft  ft  of  the  non¬ 
respondents  (with  y;-  =  0)  in  the  sample,  where 
0  <  ft  <  1  and  0  <  ft  <  1  are  known.  Let  Zi  be  the 
selection  variable  with  Zi  =  1  for  selected  observa¬ 
tions  and  Zi  =  0  for  deleted  ones.  The  selection  is 
supposed  to  be  random. 

a.  Show  that  the  sample  probabilities  ft  =  P[y,  = 
1|  Zi  =  1]  are  related  to  the  population  probabil¬ 
ities  pi  =  P\yi  =  1]  by  ft  =  pi/ (pi  +  ft  (1  -  pi)). 

b.  Suppose  that  the  population  probabilities 
pi  satisfy  the  logit  model  (with  parameters 


ft,  ft,  "  • ,  ft,  where  ft  is  the  constant  term). 
Show  that  the  sample  probabilities  ft  then 
also  satisfy  the  logit  model,  with  parameters 

ft  “log  (ft),  ft,-",  ft- 

c.  Show  that  the  individual  rankings  in  the  logit 
model  are  not  affected  by  selecting  a  subsam¬ 
ple  —  that  is,  if  pi  >  pj  then  also  ft  >  ft- 

d.  Show  that  ML  in  selected  subsamples  remains 
consistent,  but  that  it  is  not  efficient. 

e.  Suppose  that  there  are  relatively  few  respond¬ 
ents  (so  that  pi  is  close  to  zero).  In  the  sample  of 
size  n,  let  there  be  m  respondents  who  are  all 
kept  in  the  subsample,  whereas  from  the  n  —  m 
non-respondents  only  m  are  chosen  randomly. 
So  the  sample  size  is  reduced  by  a  factor  2 m/n. 
Argue  why  the  standard  errors  of  the  logit  esti¬ 
mators  will  in  general  increase  by  much  less 
than  the  factor  \Jn/2m ,  by  using  the  expression 
(6.11)  for  the  information  matrix. 

6.3  Section  6.2.2) 

a*.  Prove  the  expressions  for  the  probabilities  ft  in 
(6.20)  for  the  multinomial  and  the  conditional 
logit  model  when  the  error  terms  follow  the 
extreme  value  distribution  with  cumulative  dis¬ 
tribution  e~e  . 

b.  Prove  the  expressions  given  in  Section  6.2.2  for 
the  marginal  effects  dPMNLlyi  =  j\/dxj  in  the 
multinomial  logit  model  and  dPcLlyi  =  j]/dxij 
and  dPcLlyi  =  j]/dxn,  in  the  conditional  logit 
model. 

c.  Prove  that  the  log-likelihood  for  the  multinomial 
logit  model  is  given  by  (6.21).  Prove  also  the 
expressions  for  the  gradient  and  Hessian  matrix 
given  below  (6.21). 

d.  Prove  that  the  log-likelihood  for  the  conditional 
logit  model  is  given  by  (6.22).  Prove  also  the 
expressions  for  the  gradient  and  Hessian  matrix 
given  below  (6.22). 
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6.4  (“©  Sections  6.3. 1-6. 3. 3) 

a.  Illustrate  the  bias  term  a  a,  in  (6.28)  by  simulat¬ 
ing  a  truncated  sample  from  the  model 
y,-  =  1  +  where  x,  =  1  is  constant  and  the  e, 
are  random  drawings  from  N(0,  1).  The  sample 
is  truncated  by  considering  only  the  observa¬ 
tions  with  y,  >  0  and  deleting  all  observations 
with  y,  <  0.  Compare  the  sample  mean  of  the 
remaining  observations  with  the  theoretical 
value  of  the  inverse  Mills  ratio  2,  =  2  in  this 
case. 

b*.  Prove  the  expression  (6.30).  Prove  that  the  cor¬ 
rection  factor  in  front  of  /I  lies  between  zero  and 
one. 

c.  Compute  the  two  terms  in  (6.35)  for  the  case  of 
the  standard  normal  distribution.  Prove  that, 
when  added,  this  gives  <1 >,■/?. 

d.  Illustrate,  by  means  of  a  suitable  simulation 

experiment,  that  OLS  is  an  inconsistent  estima¬ 
tor  of  treatment  effects  if  the  treatments  are  not 
applied  randomly.  Use  this  simulation  also  to 
check  the  result  mentioned  in  Section  6.3.3  that 
E[y,|z,J  =  x'Ji  +  azi  +  pa  (which  leads  to 

(6.41)),  by  computing  the  sample  means  of  y, 
over  the  two  subsamples  (with  Zi  =  1  and  with 
Zj  =  0)  and  by  averaging  the  terms  x'/l  +  azi  + 
pa  over  these  two  subsamples. 

6.5  (“®>  Section  6.3.2) 

a.  Suppose  that  the  latent  variable  y*  satisfies 
y*  =  x'jf3  +  with  all  the  standard  assump¬ 
tions  (in  particular,  with  £,■  ~  NID(0, 1) ),  and 
that  we  observe  y,  =  y*  if  co  <  y*  <  ci  but  that 
y,  =  co  for  y*  <  co  and  y,  =  cj  for  y*  >  c\,  with 
Co  and  C\  given  constants.  Derive  the  expression 
for  the  log-likelihood  of  this  model. 

b.  Propose  a  consistent  method  to  estimate  the 
model  described  in  a  when  the  threshold  values 
Co  and  C\  are  unknown. 


c*.  In  some  applications  (for  instance,  in  income 
and  budget  studies  of  households)  the  data  are 
censored  in  the  sense  that  a  group  of  large  values 
is  summarized  by  their  sample  mean.  Suppose 
that  the  data  y;-  satisfy  the  following  model, 
where  Co  is  a  known  threshold  value  and  the 
error  terms  e,  are  NID(0,1).  For  y*  =  x'jP+ 
aei  <  co  we  observe  y,  =  y*,  but  for  y*  = 
x'jfi  +  aej  >  co  we  observe  only  the  sample 
mean  of  the  values  y*  and  the  sample  mean  of 
the  corresponding  values  of  the  explanatory 
variables  x,.  Also  the  number  of  these  large 
observations  is  given.  Derive  the  expression  for 
the  log-likelihood  of  this  model.  What  condition 
is  needed  on  co  to  be  able  to  estimate  the  par¬ 
ameters  ft  and  a  by  maximum  likelihood? 

d.  Suppose  that  the  threshold  value  Co  in  the  model 
of  c  is  unknown.  Propose  a  consistent  method  to 
estimate  the  parameters  (/?,  a,  co)  of  the  resulting 
model. 

6.6  (^  Section  6.3.4) 

a.  Show  the  expressions  in  (6.42)  concerning  the 
relations  between  the  hazard  rate,  the  survival 
function,  and  the  density  function  of  duration 
data. 

b.  Show  the  expression  (6.43)  for  the  hazard  rate 
corresponding  to  the  Weibull  density. 

c.  Show  the  expression  (6.44)  for  the  hazard  rate 
corresponding  to  the  log-normal  density.  Derive 
the  equation  given  below  (6.44)  in  Section  6.3.4 
for  the  time  instant  where  this  hazard  rate 
reaches  its  maximum. 

d.  Prove  that  the  expected  duration  in  the  propor¬ 
tional  Weibull  hazard  rate  model  is  equal  to 
E[ y,]  =  e~x'i^a'p0,  where  p0  is  the  expected  dur¬ 
ation  in  the  baseline  hazard  model  with  x,  =  0. 


EMPIRICAL  AND  SIMULATION  QUESTIONS 

6.7  (“®  Sections  6.1.2,  6.1.3) 
a.  Simulate  a  sample  of  n  =  200  data  where  x,  =  i 
and  y*  =  — 10  +  O.lx,  +  e,  with  s independent 
drawings  from  N(0,  1),  i  =  1,  ■  •  •  ,200.  Gener¬ 
ate  observed  choices  y,  by  y,  =  0  if  y*  <0  and 

y<  =  1  *f  V*i  >  0. 


b.  Compute  the  theoretical  odds  ratio 
P[y  =  1]/P[y  =  0]  for  the  following  five  values 
of  x:  60,  80,  100,  120,  and  140.  Compare  this 
with  the  odds  ratios  in  the  sample  for  the  obser¬ 
vations  with  x  respectively  in  the  intervals 
55  <  x  <  65,  75  <  x  <  85,  and  so  on,  to 
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135  <  x  <  145.  Clarify  these  results  with  a  scat¬ 
ter  diagram  of  y*  against  x. 

c.  Perform  a  regression  of  y,  on  the  constant  and  the 
variable  xM  with  result  y,  =  a  +  bxj.  Compute  the 
estimated  odds  ratios  y,/(  1  —  y, )  for  the  five  values 
of  x  in  b,  and  compare  the  outcomes  with  those 

in  b. 

d.  Estimate  a  probit  model  for  the  data  (x,,  y,), 
1=1, •■■,200.  Compute  again  the  estimated 
odds  ratios  for  the  five  values  of  x  in  b  and 
compare  the  outcomes  with  those  in  b  and  c. 

e.  Estimate  a  logit  model  for  the  data  (y,,  x,), 
/'=  l,---,200.  Compare  the  estimated  param¬ 
eters  of  this  model  with  those  obtained  for  the 
probit  model  in  d. 

f.  Using  the  logit  model  in  e,  compute  the  estimated 
odds  ratios  for  the  five  values  of  x  in  b.  Compare 
the  outcomes  with  those  of  the  probit  model  in  d. 

6.8  (“®  Sections  6.1.3,  6.1.4) 

Consider  the  following  data  generating  process.  The 
variable  x,  consists  of  independent  drawings  from 
N(100,  100),  and  y*  =  —10  +  O.lx,  +  £,  with  s,  in¬ 
dependent  drawings  from  N(0,1)  that  are  independ¬ 
ent  of  x,.  The  choices  y,  are  generated  by  y,  =  0  if 
y*  <  0  and  y,-  =  1  if  y*  >  0.  The  observed  data  con¬ 
sist  of  the  values  of  (x,-,  y,)  for  i  =  1,  ■  ■  ■ ,  n. 

a.  Generate  samples  with  n  =  100,  n  =  1000,  and 
77  =  10,000  from  this  process.  For  the  resulting 
three  data  sets,  estimate  the  parameters  of  the 
probit  model.  Compare  the  estimates  with 
the  theoretical  values  of  the  data  generating 
process. 

b.  Estimate  logit  models  for  the  three  data  sets. 
Compare  the  estimated  logit  parameters  with 
the  probit  estimates.  Explain  why  the  logit  esti¬ 
mators  are  not  consistent. 

c.  Perform  heteroskedasticity  tests  on  the  standard¬ 
ized  residuals  of  the  6  models  estimated  in  a 
and  b.  Use  the  following  specification  of  the 
standard  deviation:  07  =  eyXi ,  with  null  hypoth¬ 
esis  Ho:  y  =  0. 

d.  Now  generate  e,  by  independent  drawings  from 
the  distribution  N(0,  of ),  where  07  =  e*''100 ,  and 
generate  corresponding  new  values  of  y*  and  y,-. 
Estimate  logit  and  probit  models  for  a  sample  of 
n  =  10,000  observations  from  this  process. 

e.  Comment  on  the  outcomes  in  d.  Why  are  the 
probit  estimators  no  longer  consistent? 


f.  How  can  the  parameters  be  estimated  consist¬ 
ently  in  this  case,  if  it  is  given  that  07  =  e*'/100? 
Estimate  the  parameters  by  adjusting  for  this  het¬ 
eroskedasticity  and  compare  the  resulting  esti¬ 
mates  with  the  parameter  values  of  the  data 
generating  process. 

6.9  (”§5  Sections  6.3.1,  6.3.2) 

Consider  the  following  data  generating  process.  The 
variable  x,  consists  of  independent  drawings  from 
N(100,  100),  and  y*  =  — 10  +  O.lx,  +  e,,  with  in¬ 
dependent  drawings  from  N(0,  1)  that  are  independ¬ 
ent  of  x,.  In  Exercise  6.8  we  considered  binary  data 
related  to  y*,  but  now  we  will  consider  truncated 
and  censored  data,  where  y,  =  0  for  y*  <  0  and 
y,  =  y*  for  y*  >  0. 

a.  Suppose  that  the  sample  is  truncated,  so  that  the 
data  consist  only  of  the  observations  (x,,  y,  )  with 
y,  >  0.  Generate  a  sample  of  n  =  100  truncated 
observations.  Estimate  the  parameters  a  (the  con¬ 
stant),  /?  (the  slope),  and  a  (the  variance  of  e,-)  by 
regressing  y,-  on  a  constant  and  x,-.  Estimate  the 
parameters  also  by  ML  by  maximizing  (6.29). 

b.  Relate  the  bias  of  the  OLS  estimator  of  the  slope 
parameter  (/l  =  0.1)  in  a  to  the  result  (6.30)  on 
marginal  effects  for  truncated  data. 

c.  Now  suppose  that  the  sample  is  censored,  so  that 
the  data  consist  of  observations  (x,,  y,)  including 
also  the  cases  where  y,  =  0.  Generate  a  sample  of 
n  =  100  censored  observations.  Estimate  the  par¬ 
ameters  a,  /l,  and  a  by  regressing  y,  on  a  constant 
and  x,.  Estimate  the  parameters  also  by  ML  using 
the  standard  normal  distribution  —  that  is,  by 
maximizing  (6.36). 

d.  Relate  the  bias  of  the  OLS  estimator  of  the 
slope  parameter  (/?  =  0.1)  in  c  to  the  result 
(6.35)  that  implies  (see  p.  494)  that  for  censored 
data  dE\yi\/dxi  =  <J>(x'j3/er)/L 

e.  Compare  the  results  of  the  ML  estimates  for  the 
truncated  sample  in  a  and  for  the  censored  model 
in  c.  Which  method  produced  the  smallest  (finite 
sample)  bias,  and  which  the  smallest 
(large  sample)  standard  errors?  Could  this  be 
expected  or  not? 

6.10  (”®  Section  6.1.4) 

We  consider  a  (simulated)  data  set  of  100 
employees  in  the  ICT  sector  who  re¬ 
sponded  to  a  questionnaire  on  telework¬ 
ing.  For  each  employee  we  know  the 
answer  to  the  question  whether  she  or  he  wants  to 
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make  use  of  teleworking  (y,  =  1  if  yes,  y,  =  0  if  no), 
the  gender  (x2t  =  1  for  females  and  X2i  =  0  for 
males),  and  the  travel  distance  between  home  and 
work  (X3 i,  in  miles). 

a.  Estimate  the  logit  model  P[y,  =  1]  =  A(/j|  + 
P2x2 Does  gender  have  a  significant  effect? 
What  is  the  hit  rate  of  this  model? 

b.  Estimate  the  logit  model  P[y,  =  1]  =  A(/?j  + 
P2Xzi  +  P3X3 i)  —  that  is,  including  travel  dis¬ 
tance  as  additional  explanatory  variable.  Does 
gender  have  a  significant  effect?  What  is  the  hit 
rate  of  this  model? 

c.  Explain  the  possible  cause  of  the  differences  in 
the  significance  of  the  variable  ‘gender’  in  a  and  b. 

d.  Perform  a  Likelihood  Ratio  test  on  the  signifi¬ 
cance  of  the  variable  ‘travel  distance’. 

e.  Make  two  plots  (in  a  single  diagram)  of  P[y  =  11 
as  a  function  of  the  travel  distance,  one  for 
females  and  a  second  one  for  males.  Comment 
on  the  outcomes. 

6.11  (“®  Section  6.1.3) 

Consider  the  direct  marketing  data  of 
Example  6.1. 

a.  Let  pj  =  A  (x[b)  where  b  are  the  logit 
estimates.  Make  a  plot  of  pi  against  the  age  of 
active  males  (that  is,  with  x2 ;  =  1  and  X3;  =  1) 
and  also  against  the  age  of  inactive  males  (with 
x2 i  =  1  and  X3 i  =  0).  Do  there  exist  segments 
in  the  sample  that  show  distinct  response  behav¬ 
iour? 

b.  Answer  the  same  questions  as  in  a  for  the  probit 
model. 

c.  Plot  the  probit  log-odds  for  active  and  inactive 
males  against  their  age,  and  compare  this  with 
the  corresponding  curves  estimated  from  the  logit 
model. 

d.  The  data  set  in  Example  6.1  contains  470  obser¬ 
vations  with  yi  =  1  and  455  with  y,  =  0.  This 
data  set  is  drawn  randomly  from  a  much  larger 
set  that  contains  4988  observations  with  y,  =  1 
and  100,321  observations  with  y,-  =  0.  What  es¬ 
timated  values  of  /l  would  you  expect  if  a  logit 
model  were  to  be  estimated  for  the  set  of  all 
105,309  observations  (use  the  result  in  Exercise 
6.2  b)? 


6.12  (”®  Section  6.1.5) 

Consider  again  the  direct  marketing  data 
of  Example  6.1. 


a.  Divide  the  925  individuals  into  ten  age  groups, 
with  the  youngest  group  having  ages  of  30  years 
or  less,  the  oldest  with  ages  71  or  more,  and  the 
other  eight  groups  with  ages  in  the  five-year  inter¬ 
vals  ranging  from  31-35  to  66-70.  Determine 
the  group  sizes  and  the  group  means  of  the  ex¬ 
planatory  variables  (gender,  activity,  age)  and  the 
explained  variable  (the  fraction  of  respondents  in 
each  group). 

b.  Estimate  a  logit  model  based  on  the  G  =  10 
observations  of  the  grouped  data  in  a.  The  ex¬ 
planatory  variables  in  this  model  are  a  constant, 
the  three  group  mean  variables  for  gender,  activ¬ 
ity,  and  age,  and  finally  the  square  of  the  mean 
age  per  group. 


c.  Perform  an  LR- test  on  the  five  restrictions  of  the 
logit  model  —  that  is,  pj  =  A (xjP),  j  =  1,  •  ■  • ,  10, 
where  Xj  contains  the  values  of  the  five  explana¬ 
tory  variables  for  group  j  and  /?  contains  the  five 
model  parameters. 


d.  Estimate  the  parameters  of  the  logit  model  also 
by  applying  FWLS  to  the  grouped  data. 


e.  Compare  the  outcomes  in  b  and  d  with  those 
of  the  logit  model  for  the  individual  data  in 
Example  6.2. 


6.13  (”©  Sections  6.1.3,  6.1.4,  6.2.2, 

6.2.3) 

Consider  the  salary  data  (of  male  employ¬ 
ees)  of  Examples  6.4  and  6.5.  Instead  of 
considering  three  alternatives,  let  the  job  categories 
1  (administration)  and  2  (custodial)  be  joined  in  one 
alternative,  and  define  the  binary  variable  y,  by 
y,  =  0  if  the  /th  individual  has  an  administrative  or 
custodial  job  and  y,  =  1  for  a  management  job. 

a.  Estimate  a  logit  model  for  the  binary  variable  y, 
including  as  explanatory  variables  a  constant  and 
the  variables  education,  minority  and  previous 
experience. 

b.  Distinguish  between  minority  and  non-minority 
males,  and  compute  the  marginal  effects  (aver¬ 
aged  over  the  relevant  subsamples)  of  education 
on  the  probability  to  get  a  job  in  management. 

c.  Estimate  the  logit  model  of  a  without  the  variable 
‘previous  experience’.  Test  for  the  significance  of 
the  variable  ‘previous  experience’,  both  by  the 
t-test  and  by  the  LR- test.  Also  test  for  the  possible 
presence  of  heteroskedasticity  with  the  model 
<t,  =  eyZi  where  z,  is  the  variable  ‘education’. 
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d.  Compute  the  R 2  and  the  hit  rate  (with  corres¬ 
ponding  z-value)  for  the  estimated  logit  model 
of  c  (that  is,  without  the  variable  ‘previous 
experience’). 

e.  Compare  this  binary  logit  model  with  the  multi¬ 
nomial  logit  model  in  Example  6.4.  Which  model 
do  you  prefer  to  predict  the  probability  that  a 
male  employee  will  have  a  management  position? 

f.  The  probability  of  a  minority  male  having  a 
management  job  depends  on  his  education.  Com¬ 
pute  this  probability  for  two  levels  of  education, 
12  years  and  16  years,  for  the  following  three 
models:  the  logit  model  of  c  (without  the  variable 
‘previous  experience’)  and  the  multinomial  and 
ordered  logit  models  in  Examples  6.4  and  6.5. 
Comment  on  the  outcomes. 

6.14*  (“©  Section  6.2.2) 

In  Example  6.4  we  considered  a  multi¬ 
nomial  logit  model  for  the  attained  job 
category  of  male  employees.  There  are 
three  job  categories,  y,  =  1  for  administrative  jobs, 
y,  =  2  for  custodial  jobs,  and  y,-  =  3  for  manage¬ 
ment  jobs.  Female  employees  were  excluded  from 
the  analysis  because  there  are  no  female  employees 
with  a  custodial  job.  We  now  consider  two  possibil¬ 
ities  to  investigate  the  attained  job  categories  for 
males  and  females  jointly.  The  first  one  is  to  formu¬ 
late  a  multinomial  model  for  males  and  females, 
with  gender  as  additional  explanatory  variable,  the 
second  one  is  to  combine  the  multinomial  model  for 
males  with  a  binomial  model  for  females  (excluding 
the  custodial  jobs  for  females). 

a.  Formulate  the  multinomial  logit  model  (for  the 
data  set  of  all  474  employees,  males  and  females) 
for  the  attained  job  category,  in  terms  of  the 
explanatory  variables  gender  (with  value  0  for 
males  and  1  for  females),  education,  and  minor¬ 
ity.  Take  administration  (the  first  job  category)  as 
reference  category. 

b.  The  probability  p, a  for  a  custodial  job  can  be 
made  arbitrarily  small  for  females  by  giving  the 
corresponding  gender  coefficient  in  /?2  very  large 
negative  values  (where  /T  is  the  4x1  vector  of 
parameters  for  custodial  jobs).  Explain  that  the 
ML  estimate  of  this  coefficient  in  the  multi¬ 
nomial  model  is  equal  to  — oo.  Describe  a  prac¬ 
tical  method  for  estimating  the  remaining  seven 
parameters  of  this  model. 


c.  As  an  alternative,  write  down  the  log-likelihood 
of  the  combined  multinomial  logit  model  (for 
males)  and  binary  logit  model  (for  females,  with 
job  categories  1  and  3  alone).  Take  the  first  job 
category  as  reference  category,  and  assume  that 
the  parameter  values  in  management  jobs  for 
education  and  minority  are  the  same  for  males 
and  females  (so  that  the  model  contains  in  total 
seven  parameters,  two  for  ‘education’,  two  for 
‘minority’,  two  constants  for  males,  and  one  con¬ 
stant  for  females). 

d.  Estimate  the  parameters  of  the  models  in  b  and  c 
and  compare  the  outcomes. 

e.  Perform  diagnostic  tests  on  the  two  models  of  d 
and  compare  this  with  the  results  for  males  alone 
in  Example  6.4.  In  particular,  compare  the  signs 
and  significance  of  coefficients  and  the  hit  rates  of 
the  three  models. 

6.15  (”®  Sections  6.2.2,  6.2.3,  6.3.3) 

In  this  exercise  we  consider  some  further 
aspects  of  the  data  set  on  students  of 
the  Vanderbilt  University  discussed  in 
Example  6.8.  We  consider  the  data  of  609 
following  an  intermediate  course  in 
economics. 

a.  In  Example  6.8  the  attained  level  of  mathematics 
was  taken  as  a  binary  variable  (‘mathhigh’).  The 
data  in  the  file  are  more  refined  because  seven 
ordered  levels  of  calculus  courses  are  distin¬ 
guished  (see  the  variable  ‘levelmath’  in  the  data 
file).  Estimate  an  ordered  logit  model  for  the 
attained  level  of  mathematics,  with  the  variable 
‘levelmath’  as  dependent  variable  and  with  the 
same  explanatory  variables  w,  as  in  Example  6.8 
(see  Exhibit  6.12  b  for  the  list  of  variables). 

b.  Compare  the  estimates  of  the  ordered  logit 
model  in  a  with  the  binary  probit  model  in 
Example  6.8  (see  the  results  in  Exhibit  6.13, 
Panel  1). 

c.  In  Example  6.8  three  expected  majors  were  dis¬ 
tinguished —  namely  in  natural  science,  in  the 
areas  of  economics,  social  science,  and  humanity, 
or  in  another  field  or  an  unknown  major  (the 
reference  category).  The  data  are  more  refined, 
as  the  file  contains  five  majors  by  distinguishing 
between  majors  in  economics,  social  science, 
and  humanity.  Estimate  a  multinomial  logit 
model  for  the  expected  major  in  terms  of  the 
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following  explanatory  variables:  the  SAT  scores 
mathematics  (SATM)  and  verbal  (SATV),  the 
Freshman  grade  point  average  (FGPA),  and 
gender  (FEM). 

6.16  (“©  Section  6.3.2) 

Consider  the  direct  marketing  data.  In 
Example  6.7  a  tobit  model  is  estimated 
for  the  amount  of  money  invested, 
whereas  in  Example  6.2  a  binary  probit 
estimated  for  the  decision  to  invest. 

a.  Compare  the  estimated  parameter  vector  (1  /a)fi 
of  the  censored  regression  model  with  the  esti¬ 
mated  parameter  vector  in  the  binary  probit 
model  (that  we  denote  by  y).  Do  you  find  the 
restrictions  y  =  ( l/cr)/J  of  the  tobit  model  accept¬ 
able  for  these  data? 

b.  Discuss  possible  methods  to  obtain  a  better 
model  for  the  joint  decision  to  invest  and  how 
much  to  invest. 

6.17  (“S3  Section  6.3.4) 

In  this  exercise  we  consider  some  theor¬ 
etical  results  for  duration  models  and 
their  application  on  the  strike  data  of 
Example  6.9. 
a.  Show  that  for  finished  duration  data  the  survival 

values  S,(yj)  as  defined  in  Section  6.3.4  are  uni¬ 
formly  distributed. 


XM609DUS 


o  o 
U  k.  z 

XM601DMF 


model  is 


b.  Show  that  the  corresponding  generalized  re¬ 
siduals  e,  =  —  log(S,(y,))  have  an  exponential 
distribution  with  density  e~‘  and  with  &th 
moment  k\  =  k  ■  (k  —  1)  •  ■  •  2  •  1. 

c.  Show  that,  if  the  sample  contains  no  censored 
durations,  ML  in  (6.45)  always  gives  a  sample 
mean  of  the  generalized  residuals  of 

E”=W«  =  i- 

d.  For  the  censored  duration  data  in  Example  6.9 
(with  censoring  from  above  at  eighty  days), 
compute  the  generalized  residuals  of  the  model 
with  exponential  hazard  rate  both  for  the  case 
without  and  for  the  case  with  explanatory 
variable. 

e.  Make  plots  of  the  sample  cumulative  distribution 
functions  of  the  generalized  residuals  for  the  two 
models  of  d,  and  compute  the  first  three  sample 
moments  of  the  generalized  residuals.  What  is 
your  conclusion? 

f.  For  the  log-normal  model  log(y,)  =  a  +  /lx,  +  e,- 
estimated  in  Example  6.9,  determine  the  hazard 
rates  after  t  =  10  days  for  the  values 
Xj  =  —0.10,  Xi  =  0,  and  x,  =  0.07  of  the  produc¬ 
tion  index.  At  what  time  instant  does  the  hazard 
rate  reach  its  maximum  when  x,-  =  0  (it  may  be 
helpful  to  plot  i(t)  as  a  function  of  t  to  determine 
the  location  of  the  maximum  graphically)? 
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Time  Series  and 
Dynamic  Models 


This  chapter  treats  the  modelling  of  variables  that  are  observed  sequentially 
over  time.  The  main  focus  is  on  univariate  time  series  models  for  a  single 
economic  variable,  but  we  also  discuss  regression  models  with  lags  and 
multivariate  time  series  models.  Time  series  analysis  consists  of  several 
phases.  First,  the  dynamic  structure  of  the  model  is  selected  and  then  the 
parameters  of  the  model  are  estimated.  Diagnostic  tests  are  performed  to  test 
the  model  assumptions,  and  the  outcomes  may  suggest  alternative  specifica¬ 
tions  of  the  model.  When  an  acceptable  model  has  been  obtained,  it  can  be 
used,  for  example,  to  forecast  future  values  of  the  variables.  Many  economic 
time  series  contain  trends  that  are  of  major  importance  in  forecasting.  The 
series  may  also  exhibit  seasonal  variation  or  a  variance  that  changes  over 
time.  Therefore  we  pay  special  attention  to  the  modelling  of  trends,  seaso¬ 
nal,  and  the  variance  of  economic  time  series. 

Sections  7. 1-7.3  are  the  basic  sections  of  this  chapter  that  are  required 
for  the  material  discussed  in  Sections  7.4-7.7.  Sections  7.4  and  7.5  can  be 
read  independently  from  each  other,  Section  7.5  is  required  for  Sections  7.6 
and  7.7. 
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7.1  Models  for  stationary  time 
series 

Uses  Chapters  1-4;  Section  5.5. 


7.1.1  Introduction 

To  get  an  idea  of  economic  time  series  we  consider  two  series  that  are  used  as 
leading  examples  in  this  chapter. 


E 
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Example  7.1:  Industrial  Production 

In  this  section  and  in  following  sections  we  will  consider  a  time  series  that 
measures  the  quarterly  industrial  production  in  the  USA.  The  data  are  taken 
from  the  OECD  main  economic  indicators.  We  will  discuss  (i)  the  data  and 
(ii)  some  useful  transformations  of  the  data. 

(i)  The  data 

Exhibit  7.1  {a)  shows  the  quarterly  index  of  total  industrial  production  in  the 
USA.  In  time  series  plots,  the  horizontal  axis  always  measures  time  (here 
the  years  and  quarters)  and  the  vertical  axis  measures  the  values  of  the  time 
series.  If  we  would  follow  the  convention  of  scatter  diagrams,  the  horizontal 
axis  should  be  labelled  as  ‘time’  and  the  vertical  axis  as  the  ‘observed  series’. 
However,  in  time  series  plots  one  usually  places  the  name  of  the  observed 
variable  on  the  horizontal  axis  as  this  is  easier  to  read,  but  note  that  the 
values  of  this  variable  are  measured  on  the  vertical  axis.  The  series  is  indexed 
so  that  the  average  of  the  four  quarterly  values  of  industrial  production 
over  1992  is  equal  to  100.  The  sample  period  runs  from  1950.1  to  1998.3. 
In  our  analysis  of  this  series,  the  data  prior  to  1961.1  are  used  as 
starting  values  and  the  data  from  1995.1  to  1998.3  are  left  out  to  evaluate 
the  out-of-sample  forecasting  performance  of  proposed  models.  Therefore, 
in  modelling  this  series  the  effective  sample  ranges  from  1961.1  to  1994.4 
and  contains  n  —  136  observations.  We  denote  the  industrial  production 
index  by  xt.  Here  we  follow  the  convention  in  the  time  series  literature 
to  use  the  index  t  (instead  of  i),  because  the  observations  are  naturally 
ordered  with  time. 
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(ii)  Transformations  of  the  data 

The  series  xt  shows  exponential  growth  over  time.  Therefore  we  will  con¬ 
sider  models  for  the  logarithm  of  this  series.  We  denote  the  resulting  series  by 
yt  =  log(xr)-  The  series  yt  is  shown  in  Exhibit  7.1  (b).  It  contains  a  clear 
upward  trend  and  some  fluctuations  that  may  be  due  to  seasonal  effects. 
Models  for  the  trend  of  this  series  are  described  in  Examples  7.13  and  7.14 
and  the  seasonal  effects  are  discussed  in  Example  7.16.  Exhibit  7.1  (c)  shows 
the  quarterly  series  of  annual  growth  rates,  defined  by 


A,r,  =  y,  -  y,-,  =  log(|T)  =  log(l  + 


Xf  Xt—4 
Xt- 4 


This  series  contains  no  trend  anymore  and  moves  with  gradual  upward 
and  downward  fluctuations  around  a  long-term  mean.  Such  fluctuations 


Exhibit  7.1  Industrial  Production  (Example  7.1) 

Quarterly  series  of  US  industrial  production  (X  in  (a)),  in  logarithms  (Y  =  log  (X)  in  (b)),  and 
the  corresponding  yearly  growth  rates  (D4Y  =  Y  -  Y(  — 4)  in  (c)). 
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correspond  to  a  business  cycle  with  negative  growth  rates  in  the  recession 
periods,  for  example  around  the  periods  1974-5  and  1980-2.  This  series  of 
quarterly  growth  rates  will  be  further  analysed  in  later  examples  in  Sections 
7.2.2-7.2A. 

Example  7.2:  Dow-Jones  Index 

As  a  second  example  we  consider  the  Dow-Jones  Industrial  Average.  The 
data  are  taken  from  the  Internet  database  ‘Economagic’.  We  will  discuss 
(i)  the  data  and  (ii)  some  useful  transformations  of  the  data. 

(i)  The  data 

The  observed  series  consists  of  the  daily  close  of  the  Dow-Jones  index  over 
the  period  1990  (2  January)  to  1999  (31  December).  The  series  (denoted  by 


(a)  ( b ) 


Exhibit  7.2  Dow-Jones  Index  (Example  7.2) 


Dow-Jones  Industrial  Average  (DJ  in  (a)),  series  of  logarithms  of  the  Dow-Jones  (LOGDJ  in 
(£>)),  series  of  daily  returns  (DLOGDJ,  the  series  of  first  differences  of  LOGDJ  in  (c))  and  years 
((d)-  the  observation  number  is  measured  on  the  horizontal  axis). 
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DJt)  contains  n  =  2528  observations  and  is  shown  in  Exhibit  7.2  (a).  The 
days  are  numbered  consecutively  so  that  closing  days  (weekends  and  holi¬ 
days)  are  not  taken  into  account.  Exhibit  7.2  (d)  shows  the  years  that 
correspond  to  the  observation  numbers. 

(ii)  Transformations  of  the  data 

The  Dow-Jones  index  shows  an  exponential  trend  with  fluctuations  that 
become  more  pronounced  for  higher  levels  of  the  index.  Exhibit  7.2  (b) 
shows  the  logarithm  of  the  index  (which  we  denote  by  yt  =  log  (DJt)).  The 
fluctuations  around  the  trend  are  now  more  stable.  Exhibit  7.2  (c)  shows  the 
series  of  daily  returns  of  the  index,  defined  by 


Ayr  =  yt-  yt- 1 


logfl  + 


DJt  ~  DJt-  A 
DJt- 1  ) 


DJt  ~  DJt- i 
DJt- 1 


The  variance  of  the  series  of  daily  returns  changes  over  time,  with  volatile 
periods  followed  by  periods  with  smaller  fluctuations.  The  nature  of  the 
trend  in  the  Dow-Jones  index  is  analysed  in  Section  7.3.3,  and  models  for 
changes  in  the  variance  are  discussed  in  Section  7.4.4. 

Structure  of  Sections  7. 1-7.4 

The  actual  modelling  of  univariate  time  series  like  the  ones  in  the  above  two 
examples  is  described  in  Sections  7.2  (for  stationary  series),  7.3  (for  time 
series  with  trends  and  seasonals),  and  7.4  (for  time  series  with  non-linear 
aspects).  This  requires  some  basic  models  and  tools  in  time  series  analysis, 
which  are  now  discussed. 


7.1.2  Stationary  processes 

Time  series 

When  a  variable  is  observed  sequentially  over  time,  the  observations  consti¬ 
tute  a  time  series.  Such  a  time  series  consists  of  a  set  of  realized  values  of 
the  relevant  business  or  economic  process  that  evolves  over  time.  The  fre¬ 
quency  of  observation  can,  for  example,  be  annual,  quarterly  (as  in  Example 
7.1),  monthly,  or  daily  (as  in  Example  7.2).  Time  series  data  are  often 
strongly  correlated  over  time.  For  example,  if  the  current  industrial  produc¬ 
tion  index  has  a  value  of  100,  then  it  is  more  likely  that  the  next  quarter 
this  index  will  be  somewhere  between  90  and  110  than  that  it  will  be  as 
high  as,  say,  200. 
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Stationarity 

A  time  series  is  called  stationary  if  its  statistical  properties  remain  constant 
over  time.  This  means  that,  when  we  consider  two  different  time  intervals, 
the  sample  mean  and  sample  covariances  of  the  time  series  over  the  two  time 
intervals  will  be  roughly  the  same.  More  precisely,  a  time  series  yt  is  called 
(. second  order)  stationary  if  the  following  conditions  are  satisfied: 

E[yt]  =  H,  E[(y t  -  ti)2]  =  y0,  E[(yt  -  n){yt-k  ~  A*)]  =  Ik  (for  all  t). 

Here  n,  y0,  and  y ^  are  finite-valued  numbers  that  do  not  depend  on  time  t.  So 
the  mean  has  to  be  constant  over  time,  and,  if  the  series  has  a  trend,  this 
should  be  removed  (see  Section  7.3).  Also  the  variance  has  to  be  constant, 
and,  if  the  series  contains  seasonal  fluctuations  or  changing  variance,  this 
should  also  be  removed  (see  Sections  7.3.4  and  7.4).  Finally,  the  covariances 
are  constant  over  time  —  for  instance,  the  covariance  between  the  industrial 
production  in  two  consecutive  quarters  is  the  same  for  all  quarters  and  over 
all  years. 

Autocorrelations  of  a  stationary  process 

The  autocorrelations  of  a  stationary  process  are  defined  by 


These  correlations  describe  the  short-run  dynamic  relations  within  the  time 
series,  in  contrast  with  the  trend,  which  corresponds  to  the  long-run  behav¬ 
iour  of  the  time  series.  A  time  series  model  summarizes  the  correlations 
between  yt  and  the  past  values  yt-k,  k  >  1,  in  terms  of  a  limited  number  of 
parameters.  This  differs  from  the  models  discussed  in  the  foregoing  chapters, 
where  the  outcomes  of  the  dependent  variable  were  explained  in  terms  of 
other,  independent  variables.  For  instance,  in  the  regression  model  the  ex¬ 
plained  part  of  yt  is  given  by  x'tb,  where  b  =  (X'X)~1X'y,  which  involves  the 
correlations  between  the  dependent  and  the  independent  variables.  In  a 
univariate  time  series  model,  it  is  the  correlation  with  lagged  values  of  the 
explained  variable  that  is  of  interest. 

Time  series  prediction  and  the  innovation  process 

To  describe  the  correlations,  we  imagine  that  our  observed  time  series  comes 
from  a  stationary  process  that  existed  before  we  started  observing  it.  For 
instance,  in  Examples  7.1  and  7.2  we  use  data  on  US  industrial  production 
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from  1961  to  1994  and  on  the  Dow-Jones  from  1980  to  1999,  but  both 
processes  existed  before  the  start  of  our  observations.  We  denote  the  past  of 
the  stationary  process  yt  by  Yt-\  =  {yt-i,  yt-2,  •  •  ■},  where  the  ‘dots’  mean 
that  there  is  no  clear-cut  beginning  of  this  past.  Here  Yt~ i  is  called  the 
information  set  available  at  time  point  (t  —  1).  The  least  squares  predictor 
of  yt  based  on  the  past  Yt_i  is  the  function  f(Yt_\)  that  minimizes 
E[(yt  ~  f(Yt- 1 ))2J.  This  predictor  is  given  by  the  conditional  mean 
f(Yt- 1)  =  £[y{|Y(_i]  with  corresponding  (one-step-ahead)  prediction  errors 

Et  —  Vt  ~  f(Yt-i)  —  yt  —  E[yt\Yt_\\.  (7.1) 

The  process  ef  is  also  called  the  innovation  process,  as  it  corresponds  to  the 
unpredictable  movements  in  yt.  If  the  observations  are  jointly  normally 
distributed,  then  the  conditional  mean  is  a  linear  function  of  the  past  obser¬ 
vations  —  say, 


E[yt\Yt-\]  =  a  +  7tiyf_i  +  n2yt-2  H - • 

Here  a  is  included  to  model  the  mean  E[yt]  =  f.i  of  the  series.  From  the  above 
equation  we  get  p  =  a  +  Yl  nktL->  so  that  n  =  (1  —  tt^)_1a.  As  the  process  is 

assumed  to  be  stationary,  the  coefficients  74  do  not  depend  on  time  and  the 
innovation  process  st  is  also  stationary.  It  has  the  following  properties  (see 
Exercise  7.1): 


E[ st\  =  0  for  all  t, 

E[e2]  =  a 1  for  all  t, 

E[sset\  =  0  for  all 

Here  the  variance  a2  is  constant  over  time.  Such  a  process,  with  all  auto¬ 
correlations  equal  to  zero,  is  called  white  noise.  It  has  all  the  properties  (zero 
mean,  homoskedastic,  uncorrelated)  of  the  disturbance  term  in  the  standard 
regression  model. 

Autoregressive  model  for  stationary  time  series 

We  can  rewrite  (7.1)  as 


yt  =  a  +  K\yt- 1  +  niyt-2  H - F  e{.  (7.2) 

This  can  be  interpreted  as  a  regression  model  with  disturbance  terms  that 
satisfy  the  standard  assumptions  of  the  regression  model  —  that  is,  Assump¬ 
tions  2,  3,  and  4  of  Section  3.1.4  (p.  125).  The  above  model  is  called  an 
autoregressive  model.  The  regressors  consist  of  time  lags  of  the  dependent 
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variable  and  are  therefore  stochastic.  The  regressors  in  this  model  are 
exogenous,  because,  for  all  k  >  1,  belongs  to  the  information  set  Yt~ 1, 
so  that 


co v(yt-k,  £f)  =  E[(yt-k  -  p)(yt  -  E[yt\yt-i])] 

=  £[( yt-k  ~  n)yt]  ~  E[E[(yt-k  -  ii)yt\Yt-i]] 
=  E[(yt_k  -  n)yt]  -  E[(yt_k  -  n)yt]  =  0. 


General  idea  of  estimation  of  time  series  models 

The  model  (7.2)  has  many  parameters,  in  principle  as  many  as  the  number  of 
periods  since  the  beginning  of  the  process.  In  practice  the  data  generating 
process  is  unknown  and  the  data  consist  of  observations  yt  on  a  limited  time 
interval  t  =  1,  •  •  • ,  n.  To  estimate  a  time  series  model,  the  model  (7.2)  should 
be  approximated  by  models  with  fewer  parameters.  That  is,  the  unknown 
optimal  prediction  function  /"(Y(_ i)  has  to  be  approximated  by  a  model  of 
our  choice.  We  denote  the  model  by  f(Yt- i,  6),  where  f  is  a  specified  function 
containing  a  limited  number  of  unknown  parameters  9.  In  Sections  7.1.3  and 
7.1.4  we  discuss  models  that  are  often  used  for  this  purpose  —  namely, 
ARMA  models.  The  parameters  9  of  the  model  can  be  estimated,  for  in¬ 
stance,  by  minimizing  the  sum  of  squared  prediction  errors 

S(9)  =  YT(yt-f(Yt-1,d)y. 

t= l  v 

If  the  model  is  properly  specified  in  the  sense  that  f(Yt- 9)  & 
E[yf|Yt_i],  then  the  prediction  errors  will  be  close  to  the  innovations  st. 
This  can  be  used  as  a  basis  for  diagnostic  testing,  by  testing  whether  the 
model  residuals  are  uncorrelated  and  have  constant  variance.  Estimation, 
diagnostic  tests,  and  model  selection  are  discussed  in  Section  7.2. 

Exercises:  T:  7.1a. 


7.1.3  Autoregressive  models 

Autoregressive  process  of  order  p:  AR(p) 

A  simple  model  for  a  time  series  yt  is  obtained  by  choosing  a  specific  finite 
length  of  the  autoregression  (7.2).  If  p  past  values  are  included  in  the  regres¬ 
sion,  this  gives  the  model 
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Vt  =  V-  +  ^iyt-1  +  4>iyt-2  H - \-(t>pyt-p  +  £ t,  t  =  p  +  1,  2,  ■  ■  ■ ,  n.  (7.3) 

Here  a  and  4>1  to  4>p  are  unknown  parameters.  The  process  st  is  white  noise 
with  the  property  that  E[styt-k]  =  0  for  all  k  >  1.  So  the  regressors  y t_k  in 
(7.3)  are  exogenous,  k  =  1,  As  the  time  series  yt  is  observed  for 

t  —  I,---,  n,  the  lagged  explanatory  variable  yt-p  is  available  only  from 
time  t  =  p  +  1  onwards.  This  model  is  called  an  autoregressive  model  of 
order  p,  also  written  as  AR(p). 

The  lag  operator 

For  ease  of  notation  one  uses  the  lag  operator  L  defined  by 


Lyt  =  yt-\. 


Repetitive  application  of  this  operator  gives  Lkyt  =  yt~k-  The  AR(p)  model 
can  then  be  written  in  a  more  concise  form  as 

4>(L)yt  =  a  +  £j,  4>(L)  =  1  -  (ptL - 4>pLp .  (7.4) 

Stationarity  condition 

The  statistical  properties  of  the  process  (7.4)  are  determined  by  the  values  of 
the  parameters  4>1,  ■  ■  • ,  4>p.  For  instance,  the  condition  for  stationarity  can  be 
expressed  in  terms  of  the  roots  of  the  polynomial  4>(z)  by  factorizing  this 
polynomial  in  terms  of  its  p  (possibly  complex  valued)  roots  z  =  1/a,  as 

4>(z)  =  (1  -  aiz)(l  -  a2z)  •  •  ■  (1  -  a.pz).  (7.5) 

The  process  is  stationary  if  and  only  if  \a.p\  <  1  for  all  k  =  1,  •  •  • ,  p  —  that  is, 
all  the  solutions  of  4>(z)  =  0  should  lie  outside  the  unit  circle  in  the  complex 
plane.  Here  we  will  clarify  this  in  more  detail  for  the  case  of  an  AR(1)  model; 
the  general  AR(p)  model  is  left  as  an  exercise  (see  Exercise  7.1). 


Derivation  of  stationarity  condition  for  an  AR(1)  process 

We  consider  the  first  order  autoregressive  model 


yt  =  Ht- 1  +  St,  t  =  2,  •  •  • ,  n.  (7.6) 

Here  et  is  the  innovation  process  and  for  simplicity  of  notation  we  write  (f>  for  the 
parameter  (f>1  and  we  assume  that  a  =  0.  By  recursive  substitution  of  the  lagged 
values  of  yt  this  can  be  rewritten  as 
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t-2 

yt  =  +  X  t  =  2,---,n.  (7.7) 

/= o 

An  innovation  at  time  t  —  j  therefore  affects  the  value  of  yt  with  multiplier 
(/)'.  If  |0|  >  I,  then  the  impact  of  the  innovations  grows  over  time  and  the 
time  series  displays  explosive  behaviour,  whereas  for  |0|  <  1  the  impact  dies  out 
over  time.  We  will  now  show  that  the  AR(1)  process  is  stationary  if  and  only  if 
101  <  1- 

First  we  suppose  that  the  process  yt  is  stationary  with  mean  /(  and  variance  y0, 
and  we  will  prove  that  1 0|  <  1.  Recall  that  st  has  mean  zero  and  variance  a2,  that 
et  is  uncorrelated  with  yt-i,  and  that  y0  =  E[(yt  — /i)2]  =  E[y2]  —  ft2 .  It  then 
follows  from  (7.6)  and  (7.7)  that 

/(  =  E[yt]  =  0*“V> 

7o  +  ^  =  E[y2]  =  <jrE[y2^]  +  a2  =  <p2{y0  +  n i 2)  +  a2. 

The  first  equality  implies  that  \i  =  0  or  0  =  1,  but  in  the  last  case  the  second 
equality  has  no  finite  solution  for  y0  (because  <r2>  0).  So  we  conclude  that  0  1 

and  /(  =  0.  Then  the  second  equality  becomes  y0  =  02yo  +  c 2  or  02  =  (y0  —  o2)/yo, 
so  that  |0|  <  1.  This  shows  that  for  a  stationary  process  \<j)\  <  1. 

Now  we  prove  the  reverse  —  that  is,  if  |0|  <  1,  then  (7.6)  has  a  solution  process 
yt  that  is  stationary.  We  prove  this  by  constructing  the  process  yt.  Let  y\  be  a 
random  variable  with  mean  zero  and  variance  cr2/ ( 1  —  02)  and  let  et  be  IID(0,  a2) 
for  t  >  2  and  independent  from  y\.  Further  let  yt  for  t  >  2  be  defined  by  (7.7). 
Then  it  follows  that  E[yt]  =  0  for  all  t  >  1,  so  that  the  mean  fi  =  0  is  constant  over 
time.  It  remains  to  prove  that  the  variance  and  covariances  of  this  process  are 
constant  over  time.  Using  the  fact  that  £[esei]  =  0  for  all  s/i,  that  E[yi£«]  =  0  for 
all  t  >  2,  and  that  for  \<j)\  <  1  there  holds  o  <t)lb  =  1/(1  —  02),  we  obtain  from 
(7.7)  that  for  0  <  k  <  t  —  1 


E[ytyt-k\  =  E 


u2t — k — 2 


t-2 

/= 0  /  V 

f _ _ 2 

var(yi)  +  er2  ^  A k+2h 


t-k-2  N 

Myx  +  X  fa-k-h 


h=  0 


i2t-k-2 

=  <T2^ - v+ffVI 


h=0 


i2h 


1-0Z 

±2t-k-2 


1  - 


X  ' 

\h=0  h=t—k—l 

jjl  ,  4t 

l-*2  =c~ 


a  ■ 


This  shows  that  the  variance  of  yt  (obtained  for  k  =  0)  is  constant  over  time  and 
that  the  covariance  between  yt  and  yt-k  does  not  depend  on  time  t.  This  means 
that  the  AR(1)  process  is  stationary  for  |0|  <  1,  which  concludes  the  proof. 
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Variance  and  autocorrelations  of  an  AR(1)  process 

The  above  derivation  shows  that  the  variance  of  a  stationary  AR(1)  process 
(with  \cj)\  <  1)  is  equal  to 


To  = 


l-r/)2' 


The  autocorrelations  are  given  by 


Pk  = 


n 

To 


The  correlations  tend  exponentially  to  zero  for  k  — t  oo  with  a  speed  that 
depends  on  </>.  If  this  coefficient  is  very  close  to  one,  then  the  correlations  die 
out  only  very  slowly.  For  (f>  =  1  the  process  yt  is  no  longer  stationary,  it  does 
not  have  a  finite  variance,  and  it  has  trending  behaviour,  as  will  be  further 
discussed  in  Section  7.3. 


Mean  of  an  AR(p)  process 

The  constant  term  a  in  (7.3)  is  included  to  allow  for  a  non-zero  mean 
of  the  time  series.  Note,  however,  that  this  parameter  is  not  equal  to 
the  mean  p  =  E[yt]  of  the  process.  By  taking  expected  values  in  (7.3)  it 
follows  that  (1  —  YX=i  4>k)P  =  or  P  —  a/(l  —  YlPk=i  4>k)-  This  can  also  be 
written  as 


P  =  E[yt]  =  a/ </>(l), 

where  (/>(  1)  =  1  —  Yjk=\  $ k  is  the  value  obtained  by  evaluating  the  polyno¬ 
mial  cj)(z)  of  (7.4)  at  z  =  1.  If  (f)(1)  =  0  —  that  is,  if  the  polynomial  (f>(z)  of  the 
AR (p)  model  has  a  root  at  z  =  1  (called  a  unit  root)  —  then  the  mean  of  the 
process  is  not  defined.  This  is  in  line  with  the  fact  that  the  process  is  not 
stationary  in  this  case. 

Example  7.3:  Simulated  AR  Time  Series 

As  an  illustration  we  consider  three  simulated  time  series.  The  series  are 
generated  respectively  by  the  white  noise  process  yt  =  ef,  by  the  stationary 
AR(1)  process  yt  =  0.9yt-i  +  st,  and  by  the  stationary  AR(2)  process 
yt  =  1.5yt-\  —  0 .6yt-2  +  £«•  Exhibit  7.3  shows  time  plots  of  the  three  simu¬ 
lated  time  series.  The  white  noise  process  is  uncorrelated,  whereas  the 
AR(1)  process  is  strongly  correlated  over  time  with  pk  =  (0.9)*,  k>0. 


E 
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(a)  ( b ) 


Exhibit  7.3  Simulated  AR  Time  Series  (Example  7.3) 

Simulated  time  series,  white  noise  (a),  AR(1)  process  (b),  and  AR(2)  process  (c). 


The  AR(2)  process  shows  more  or  less  steady  oscillations.  This  is  related  to 
the  fact  that  the  corresponding  polynomial  4>(z)  =  1  —  1.5z  +  0 ,6z2  has  com¬ 
plex  roots  z=  1.25  ±0.32/  (where  i  is  the  complex  number  defined  by 

i  =  yrzi). 

Exercises:  T:  7.1b. 

7.1.4  ARMA  models 

Moving  average  process  of  order  q:  MA(q) 

A  process  yt  is  called  a  moving  average  process  if  it  can  be  described  by 

Tf  =  a  +  +  9 l£f-l  +  '  '  '  +  QqSf-q, 


(7.8) 
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where  st  is  white  noise.  This  is  called  an  MA (q)  process.  This  process 
is  always  stationary,  with  mean  =  E[yt  J  =  a,  variance  y0  = 
ff2(l  +  E/=i  0,2),  and  covariances  yk  =  a 2(0k  +  for  k  <  q 

and  y^  =  0  for  k  >  q. 


Invertibility  condition 

In  an  MA  model,  the  observed  series  yt  is  expressed  in  terms  of  current  and 
past  values  of  the  error  terms  st.  In  a  sense,  this  is  the  inverse  of  the 
autoregressive  model  (7.2),  where  st  is  expressed  in  terms  of  current  and 
past  values  of  yt.  If  an  MA  model  can  be  expressed  as  an  (infinite  order) 
autoregressive  model  (7.2),  then  the  MA  model  is  called  invertible.  In  this 
case  the  error  terms  st  in  (7.8)  are  equal  to  the  innovations  (or  one-step-ahead 
prediction  errors)  st  =  yt  —  E\yt\  Yf_i],  so  that 

E[yt\Yt-i\  =  a  +  +  ■  ■  ■  +  9qEt-q. 

Invertibility  requires  some  restrictions  on  the  parameters  in  (7.8).  If 
the  MA  polynomial  is  factorized  as  6(z)  =  (1  —  P\Z){1  —  /?2z)  •■■(!—  Pqz), 
then  invertibility  is  equivalent  to  the  condition  that  |  /?;-|  <  1  for  all 
=  1,  2,  •  •  • ,  q.  Stated  otherwise,  all  the  solutions  of  6(z)  =  0  should  lie 
outside  the  unit  circle.  Here  we  will  explain  this  condition  in  more  detail 
for  the  MA(1)  model;  the  general  case  (with  q  >  2)  is  left  as  an  exercise  (see 
Exercise  7.1). 


Derivation  of  invertibility  condition  for  MA(1)  process 

We  consider  the  MA(1)  model  with  mean  a  =  0  described  by 


yt  —  £t  +  @£t- 1  ■ 

Invertibility  requires  that  st  can  be  written  in  terms  of  current  and  past  values 
of  the  observed  process  —  that  is,  in  terms  of  yt_k  with  k>0.  Now  st  = 
yt  —  i,  and  similarly  £f_i  =  yt-\  —  9e{^ 2,  and  by  further  substitutions  it  follows 
that 


et  =  yt-  Byt- 1  +  e2yt. 2  -•••  +  (-  6)^2  +  (  -  (7.9) 

Assuming  that  the  process  was  the  same  before  the  observations  started  (that  is, 
for  t  <  0),  this  substitution  can  be  continued.  Invertibility  requires  that,  in  the 
limit,  the  error  term  on  the  right-hand  side  vanishes.  This  is  the  case  if  and  only  if 
—  1  <  6  <  1.  Under  this  condition  the  process  is  therefore  invertible. 
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Autoregressive  moving  average  process:  ARMA(p,  q) 

Every  stationary  process  can  be  approximated  with  any  desired  degree  of 
accuracy  by  means  of  autoregressive  models  (by  taking  the  order  p  of  the 
AR (p)  model  large  enough)  and  also  by  means  of  moving  average  models  (by 
taking  the  order  q  of  the  MA^)  model  large  enough).  However,  good 
approximations  may  require  very  large  orders  and  hence  many  parameters. 
It  may  then  be  more  convenient  to  describe  the  process  yt  by  the  ratio  of  two 
polynomials  of  relatively  low  order  as  yt  —  The  resulting  model  is 

written  as 


$(L)yt  =  oc  +  9(L)et,  (7.10) 

with  AR  polynomial  </>(L )  =  1  —  </qL  —  •  •  •  —  (ppL p,  MA  polynomial  6(L)  = 
1  +  0iL  +  •  •  ■  +  0qLq,  and  where  st  is  white  noise.  The  above  model  is  called 
an  autoregressive  moving  average  model  of  order  (p,  q),  also  written  as 
ARMA(p,  q).  The  constant  term  a  is  included  to  allow  for  a  non-zero  mean 
f.i  =  E[yt ]  =  oc/4>(  1).  An  ARMA  process  is  stationary  if  all  the  solutions  of 
4>(z)  =  0  lie  outside  the  unit  circle,  just  as  in  the  case  of  AR  processes.  An 
ARMA  model  is  invertible  if  all  the  solutions  of  9(z)  =  0  lie  outside  the  unit 
circle,  just  as  in  the  case  of  MA  processes. 

In  many  cases,  low  order  ARMA  models  provide  an  accurate  approxima¬ 
tion  of  much  higher  order  AR  and  MA  models.  That  is,  ARMA  models  need 
relatively  few  parameters  to  describe  the  process,  so  that  ARMA  models  are 
parsimonious  in  this  sense. 


Example  7.4:  Simulated  MA  and  ARMA  Time  Series 

As  an  illustration  we  simulate  three  time  series  —  namely,  from  the  MA(1) 
model  yt  —  st  +  0.9ef_i,  from  the  MA(2)  model  yt  =  £t  +  0.9st-i  +  0.8st-2, 
and  from  the  ARMA(1,1)  model  yt  =  0.9yt~\  +  st  +  0.8ef_i.  All  three 
processes  are  stationary  and  invertible.  Exhibit  7.4  contains  graphs  of  the 
three  simulated  series.  Comparing  the  graphs  in  Exhibit  7.4  with  those  in 
Exhibit  7.3,  we  see  that  it  may  not  be  easy  to  determine  the  appropriate 
ARMA  model  from  a  time  series  plot.  The  next  section  discusses  a  tool  to 
select  the  orders  of  AR  and  MA  models. 
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(a)  ( b ) 


| - MAI  I  I - MA2  I 


(c) 


| - ARMAlij 

Exhibit  7.4  Simulated  MA  and  ARMATime  Series  (Example  7.4) 

Simulated  time  series,  MA(1)  process  (a),  MA(2)  process  (b),  and  ARMA(1,1)  process  (c). 


Exercises:  T:  7.1c. 


7.1.5  Autocorrelations  and  partial  autocorrelations 

Autocorrelation  function 

As  mentioned  before,  the  correlations  between  successive  values  of  a  time 
series  are  of  key  interest  in  forecasting  the  future  movements  of  the  series. 
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Therefore  it  is  of  interest  to  get  insight  in  the  correlations  that  are  implied  by 
an  ARMA  model.  The  correlations  have  a  direct  meaning  in  terms  of  time 
series  properties,  whereas  the  parameters  in  the  ARMA  model  have  only  a 
less  direct  interpretation.  The  autocorrelation  function  (ACF)  of  a  time  series 
model  is  defined  by  the  sequence  of  autocorrelations  (for  k  ranging  from 
— oo  to  +  oo) 


Pk=  corr  (yt,yt-k)  =  —, 

7o 

where  yk  =  E[(yt  —  p)(yt-k  ~  P)]-  There  holds  y ^  =  y_il  and  hence  also 
Pk  —  P-b  and  Po  ~  1  always.  So  it  suffices  to  consider  the  ACF  only  for 
k  >  1. 


Derivation  of  autocorrelations  of  an  ARMA  process 

A  white  noise  process  st  is  characterized  by  the  property  that  pk  =  0  for  all  k  >  f . 
Now  we  consider  the  ACF  of  an  ARMA  model  (7.10)  that  is  stationary  and 
invertible,  so  that  the  roots  of  the  polynomials  <p{z )  =  0  and  9(z)  =  0  all  lie 
outside  the  unit  circle.  The  stationarity  condition  on  0(z)  implies  that  yt  can  be 
written  as  a  linear  function  of  &t_ k  >  0  —  say, 


yt  —  P  +  £t  +  1  1  +  02^-2  +  +  •  •  •  .  (7.11) 

This  can  be  written  as  yt  =  p  +  lA(T)e«  where  i f{z)  =  •A k %k  whh  i//0  =  1-  As  yt 

also  satisfies  the  ARMA  model  equation  d>(L)yr  =  a  +  9(L)st,  it  follows  that 
a  +  9(L)st  =  cj)(L)yt  =  <p(l)p  +  <j)(L)il/(L)st,  where  we  used  the  fact  that 
< fr(L)p  =  (1  —  J2  (fkLk)p  =  (1  —  J2  4>k)p  =  0(1  )/<•  It  follows  that  /(  =  a/0(  1)  and 

<f(z)if(z)  =  9(z). 

So  the  coefficients  0^  in  (7.11)  can  be  obtained  from  the  parameters 
(0l5  •  •  • ,  0p,  9\,  ■  ■  ■ ,  9q)  of  the  ARMA  model  by  solving  the  equations 

(f){z)^j{z)  =  9(z)  (for  all  powers  zk,  k  >  0).  Once  the  values  of  i/q  are  determined, 

the  ACF  can  be  computed  by 


7k  =  Ei(yt  -  p)(yt-k  -  p)]  =  E 


^h^t-k-h 

./'= 0  h= 0 


=  °2J2ii'k+h'i'h- 


b= 0 


Here  we  used  the  fact  that  st  is  white  noise,  so  that  E[st-jSt_k_f,]  =  0  for  all 
/  ^  k  +  h. 
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Autocorrelations  of  AR(1)  and  MA(1)  process 

As  an  illustration,  for  the  AR(1)  model  (7.6)  with  yt  =  +  Et  we  have 

<j>(z)  =  1  —  <j)z  and  6(z)  =  1.  Then  the  equation  for  \j/(z)  becomes 
(1  —  <j>z)\f(z)  =  1,  so  that  i j/(z)  =  1/(1  —  4>z)  =  X/  (l)kzk  and  =  (Pk-  h  then 
follows  that  Ik  =  °2  =  a24>k  /( 1  —  <P2)  and  Pk  =  <l)k-  This  agrees  with  earlier 

results  in  Section  7.1.3  for  the  AR(1)  process. 

As  another  illustration,  for  the  MA(1)  model  yt  =  st  +  9st_  \  we  have 
(p{z)  =  1  and  9(z)  =  1  +  9z.  Then  the  equation  for  \jj(z)  simply  becomes 
i j/(z)  =  1  +  9z.  So  the  coefficients  are  directly  obtained  as  i/r0  =  1,  i fi\  =  9, 
and  =  0  for  all  k  >  2.  So  y0  =  cr2 ( 1  +  92),  y,  =  c205  and  y^  =  0  for  k>2. 
The  autocorrelations  are  pj  =  0/(1  +  02)  and  pk  =  0  for  all  k>2.  This  agrees 
with  earlier  results  in  Section  7.1.4  for  MA  processes. 

Further  cases  are  treated  in  the  exercises  (see  Exercise  7.2). 


Characterization  of  MA  processes  in  terms  of  autocorrelations 

In  the  MA(<7)  process  (7.8)  the  terms  ef  are  uncorrelated.  This  implies  that  the 
ACF  has  the  property 

pk  =  0  for  all  k>  q  (MA(q)). 

That  is,  the  ACF  of  an  MA(g)  process  cuts  off  after  lag  q.  The  reverse  also 
holds  true  —  that  is,  if  the  ACF  of  a  process  has  the  property  that  pk  =  0  for 
all  k  >  q,  then  it  can  be  written  as  an  MA(g)  process  (see  Exercise  7.3). 
Therefore  the  ACF  can  be  used  to  select  the  order  of  an  MA  model.  If  the 
ACF  of  a  process  is  zero  for  k  >  q,  then  it  can  be  described  by  an  MA(tjf) 
model. 

Characterization  of  AR  processes  in  terms  of  partial 
autocorrelations 

For  AR  processes  the  ACF  decays  to  zero  exponentially,  but  the  order  of  the 
model  cannot  easily  be  detected  from  the  ACF.  The  order  of  AR  models  can 
be  selected  by  considering  the  so-called  partial  autocorrelation  function 
(PACF).  The  partial  autocorrelation  at  lag  k,  denoted  by  4>kk,  is  defined  as 
the  correlation  between  yt  and  yt-k  that  remains  after  the  correlation  due  to 
the  intermediate  values  yt-t  (1  <j<k  —  1)  has  been  removed.  Formally,  let 

y~  =yt-  Ebtbt-i,  •  •  •  >  Vt-k+i]  and  yf_k  =  yt-k  ~  E[yt_k\yt-U  ■  ■  ■ ,  yt_k+ 1] 
be  the  ‘residuals’  of  predicting  yt  and  yt_^  from  the  intermediate  values 
{yt~ i,  •  ■  ■ ,  yt-k+ i };  then  the  partial  autocorrelation  4>kk  =  corr  (y~,  yf_k)  is 
the  correlation  between  these  two  residuals.  The  two  involved  conditional 
expectations  are  obtained  by  regressing  yt  and  yt_k  on  the  set  of  intermediate 
values.  It  follows  from  the  result  of  Frisch-Waugh  in  Section  3.2.5  (p.  146) 
that  (/>££  can  also  be  obtained  from  the  regression 


548 


7  Time  Series  and  Dynamic  Models 


yt  —  a  +  4>k\yt-i  +  4>kiyt-2  +  ■  ■  ■  +  <t>kkyt-k  +  (7.12) 

(see  Exercise  7.3).  So  4>kk  is  obtained  by  an  autoregression  of  lag  length  k  — 
that  is,  for  each  partial  autocorrelation  we  need  a  different  regression.  An 
AR (p)  process  is  characterized  by  the  following  property  (see  Exercise  7.3). 

(f)kk  =  0  for  all  k  >  p  ( AR(p )). 

That  is,  the  PACF  of  an  AR (p)  process  cuts  off  after  lag  p.  The  intuitive 
explanation  is  that,  for  the  AR (p)  process  (7.3),  yt  is  expressed  in  terms  of 
yt~ i,  •  ■  ■ ,  yt-p,  so  that  additional  lagged  regressors  yt-p  with  k  >  p  in  (7.12) 
have  coefficient  zero. 


E 


Sample  (partial)  autocorrelations 

In  practice  the  (partial)  autocorrelations  are  unknown  and  have  to  be  esti¬ 
mated  from  the  observed  time  series.  The  so-called  sample  autocorrelation 
function  (SACF)  is  obtained  by  replacing  pk  by  the  sample  correlations 

_  S"=fe+i  (yt  ~  y)(yt-k  -  y) 

T  /  — \2  ’ 

Et=i  (yt  -  y ) 

where  y  =  Yl't=  l  yt/n  is  the  sample  mean  of  the  series.  The  sample  partial 
autocorrelation  function  (SPACF)  is  obtained  by  replacing  4>kk  by  the  esti¬ 
mated  coefficient  (f>kk  of  yt_k  in  the  regression  (7.12).  Note  that  this  involves 
a  different  model  for  each  coefficient  —  that  is,  (ppp  is  obtained  by  a  regres¬ 
sion  in  an  AR(&)  model. 

Example  7.5:  Simulated  Time  Series  (continued) 

As  an  illustration,  we  consider  the  six  simulated  time  series  that  were  gener¬ 
ated  in  Example  7.3  (see  Exhibit  7.3)  and  Example  7.4  (see  Exhibit  7.4). 
Panels  1  and  2  of  Exhibit  7.5  show  the  SACF  and  the  SPACF  of  these  six 
series.  For  the  white  noise  series,  the  theoretical  ACF  and  PACF  are  both 
zero,  and  the  SACF  and  SPACF  are  relatively  small.  Statistical  tests  for  the 
significance  of  sample  (partial)  autocorrelations  are  discussed  in  Sections 
7.2.3  and  7.2.4.  For  the  two  generated  AR  series,  the  SPACF  is  small  for 
lags  k  >  1  for  the  AR(  1 )  process  and  for  lags  k  >  2  for  the  AR(2)  process.  For 
the  two  generated  MA  series,  the  SACF  is  small  for  lags  k  >  1  for  the  MA(  1 ) 
process  and  for  lags  k  >  2  for  the  MA(2)  process.  For  the  ARMA(  1,1)  process 
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Panel  1 

WN 

AR(1) 

AR(2) 

Lag 

SACF 

SPACF 

SACF 

SPACF 

SACF 

SPACF 

1 

-0.010 

-0.010 

0.833 

0.833 

0.907 

0.907 

2 

0.008 

0.008 

0.681 

-0.045 

0.713 

-0.618 

3 

-0.114 

-0.114 

0.534 

-0.072 

0.481 

-0.049 

4 

-0.074 

-0.077 

0.433 

0.059 

0.262 

0.045 

5 

-0.034 

-0.035 

0.365 

0.043 

0.080 

-0.056 

6 

0.038 

0.026 

0.312 

-0.001 

-0.060 

-0.064 

7 

-0.087 

-0.105 

0.247 

-0.062 

-0.158 

-0.017 

8 

-0.076 

-0.096 

0.215 

0.076 

-0.206 

0.083 

9 

0.018 

0.018 

0.213 

0.092 

-0.211 

-0.023 

10 

-0.001 

-0.018 

0.206 

-0.022 

-0.193 

-0.081 

Panel  2 

MA(1) 

MA(2) 

ARM  A(  1,1) 

Lag 

SACF 

SPACF 

SACF 

SPACF 

SACF 

SPACF 

1 

0.543 

0.543 

0.688 

0.688 

0.911 

0.911 

2 

0.050 

-0.347 

0.353 

-0.229 

0.742 

-0.513 

3 

0.002 

0.248 

0.014 

-0.254 

0.593 

0.295 

4 

-0.025 

-0.231 

-0.017 

0.325 

0.480 

-0.116 

5 

-0.037 

0.162 

-0.020 

-0.118 

0.401 

0.126 

6 

0.018 

-0.069 

-0.012 

-0.138 

0.336 

-0.137 

7 

0.009 

0.012 

-0.026 

0.152 

0.278 

0.101 

8 

-0.075 

-0.117 

-0.074 

-0.160 

0.244 

0.068 

9 

-0.080 

0.064 

-0.086 

-0.017 

0.233 

0.023 

10 

-0.050 

-0.094 

-0.103 

0.039 

0.228 

-0.021 

Panel  3:  Dependent  Variable:  AR1 


Method:  Least  Squares 

Sample(adjusted):  2  500;  Included  observations:  499 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-0.077447 

0.044504 

-1.740224 

0.0824 

ARl(-l) 

0.835782 

0.024710 

33.82336 

0.0000 

Panel  4:  Dependent  Variable:  AR1 

Method:  Least  Squares 

Sample(adjusted):  3  500;  Included  observations:  498 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-0.080056 

0.044741 

-1.789344 

0.0742 

ARl(-l) 

0.874953 

0.044910 

19.48236 

0.0000 

AR1(— 2) 

-0.046516 

0.044931 

-1.035273 

0.3010 

Exhibit  7.5  Simulated  Time  Series  (Example  7.5) 

First  ten  sample  autocorrelations  (SACF)  and  partial  autocorrelations  (SPACF)  of  time  series  of 
length  n  —  500  simulated  from  a  white  noise  process  (WN),  an  AR(1)  process,  and  an  AR(2) 
process  (Panel  1)  and  from  an  MA(1)  process,  an  MA(2)  process,  and  an  ARMA(1,1)  process 
(Panel  2).  The  regressions  in  Panels  3  and  4  illustrate  the  computation  of  the  first  SPACF  (Panel 
3,  value  0.836)  and  second  SPACF  (Panel  4,  value  —0.047)  of  the  series  AR1  (the  reported 
numbers  in  Panel  1  are  0.833  and  —0.045,  as  EViews  uses  a  slightly  different  method  to 
compute  the  SPACF). 
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the  SACF  and  SPACF  decay  relatively  slowly.  Panels  3  and  4  of  Exhibit  7.5 
contain  two  regressions  to  illustrate  the  calculation  of  the  SPACF  for  the 
AR(1)  time  series.  The  first  sample  partial  autocorrelation  <fin  =  0.836  is 
obtained  by  regressing  yt  on  a  constant  and  yt~ i,  and  the  second  one 
(j>22  =  —0.047  by  regressing  yt  on  a  constant,  yf_i  and  yt-i-  Clearly,  the 
variable  yt-2  in  the  last  regression  is  not  significant,  as  expected. 

=®>  Exercises:  T:  7.2b,  c,  7.3a-c. 


7.1.6  Forecasting 

One-step-ahead  and  multi-step-ahead  forecasts 

In  this  section  we  describe  general  methods  to  forecast  future  values  of  an 
ARMA  process  by  exploiting  the  correlation  structure  of  the  process.  We 
assume  that  the  process  is  known,  in  the  sense  that  all  the  parameters  are 
known  and  hence  the  correlations  of  the  process  are  also  known.  We  suppose 
that  the  time  series  is  observed  on  the  time  interval  t  =  1,  •  •  • ,  n  so  that  the 
available  information  is  given  by  Y„  =  {y\,  ■  ■  ■ ,  yn}.  The  (least  squares)  one- 
step-ahead  forecast  yn+\  =  f(Yn)  is  given  by  f(Yn)  =  E[yn+\\ Y„J,  and  the 
^-step-ahead  forecast  is  yn+p  =  E[yn+f,\Yn].  We  will  restrict  the  attention  to 
linear  forecasts  —  that  is,  to  functions  f(Y„)  that  are  linear  in  the  observa¬ 
tions  yt,  t  =  1,  •  •  • ,  n. 


Forecasting  an  AR  process 

First  we  consider  forecasting  in  the  stationary  AR (p)  model  (7.3)  with 
yt  =  a  +  dqyt-i  +  ■  ■  ■  +  4>pyt-p  +  £«•  The  stationarity  condition  on  the  AR 
polynomial  implies  that  yt  is  a  function  of  the  past  innovations  (st-k,  k  >  0), 
as  in  ( 7. 1 1 ) .  For  yn+\  this  means  that  £„+i  is  uncorrelated  with  all  observations 
in  Y„,  so  that  the  optimal  one-step-ahead  forecast  is  given  by 

yn+l  =  a  +  (friyn  +  ■  ■  ■  +  (frpyn+l-p- 

The  corresponding  forecast  error  is  yn+\  —  yn+\  =  £»+ 1  and  the  forecast  error 
variance  is  a2.  In  a  similar  way,  the  two-step-ahead  forecast  is  given  by 


y?i+i  —  a  +  4>^yn+ 1  +  4>2yn  +  ■  ■  ■  +  4>pyn+2-p- 


The  corresponding  forecast  error  is  yn+2  ~  %+2  =  £«+ 2  +  4>i(yn+i  ~  y«+i) 
=  £m+2  +  </>  1  e„_|_  1  with  variance  er2(l  +  $\).  Forecasts  for  three  and  more 
steps  ahead  can  be  constructed  in  a  similar  way  (see  Exercise  7.4). 
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Forecasting  an  ARMA  process 

For  a  stationary  and  invertible  ARMA  model  <f>(L)yt  =  a  +  6(L)Et,  the  fore¬ 
casts  can  in  principle  be  computed  from  the  moving  average  representation 
(7.11).  The  first  step  is  to  compute  the  parameters  t j/k  of  this  representation  by 
solving  the  equations  cj)(z)il/(z)  =  Q(z)  (where  cj)(z)  and  0(z)  are  given).  Then  we  can 
write 


yt  —  A  +  £t  +  |Ai£r-t  +  4*2£t-2  +  ■  ■  •  5 


with  n  =  a/ (f)(1)  known.  As  the  process  is  invertible,  the  innovation  process  has 
the  property  that  st  is  a  function  of  the  observations  yt-k->  k>  0.  Assuming  for  the 
moment  that  the  process  has  been  observed  since  the  infinite  past,  this  means  that 
the  innovations  et  are  known  for  all  times  t  <  n.  It  follows  from  the  above  moving 
average  representation  that  any  linear  ^-step-ahead  forecast  yn+i,  (that  is,  any 
linear  function  of  yt,  t  <  n)  can  be  written  in  the  form  P  +  Y^kLo  Pk£n~k-  As  the 
process  st  is  uncorrelated,  the  mean  squared  prediction  error  of  such  a  forecast  is 
equal  to 


E[(yn+h  %+h)  ]  —  (/(  P) 


'h- 1 


— h / 


<k=0 


k=h 


where  i/^0  =  1.  This  is  minimized  by  taking  p  =  ji  and  pk_h  =  for  all  k  >  h.  So 
the  optimal  fc-step-ahead  forecast  is  given  by 

Vn+h  =  +  •A b£n  +  iA/j+1£k-1  +  h+2e>t~2  H - •  (7-13) 

The  corresponding  h-step-ahead  prediction  error  (y„+/,  —  yn+h)  gives  a  forecast 
variance  of 


SPE(h)  =  E[(yn+h  -  y„+h)2] 


h —  1 


k=o 


The  one-step-ahead  prediction  error  is  equal  to  the  innovation  e„+i .  Therefore  the 
innovation  process  is  also  called  the  process  of  prediction  errors.  The  forecast 
variance  increases  if  the  forecast  horizon  h  becomes  larger.  This  is  natural,  as  the 
past  observations  contain  more  information  for  the  immediate  future  than  for 
the  future  far  ahead.  For  /;  — >  oo  the  forecast  variance  converges  to  the  variance  of 
the  process  yt.  That  is,  in  the  long  run  all  information  from  the  past  eventually  dies 
out.  This  is  because  the  correlations  pk  of  a  stationary  process  converge  to  zero  for 
k  — >  oo  ,  so  that  the  past  information  is  uncorrelated  with  the  infinitely  far  ahead 
future.  Forecast  intervals  can  be  constructed  if  the  process  is  assumed  to  be 
normally  distributed.  For  instance,  a  95  per  cent  forecast  interval  for  yn+h  is 
given  by  the  interval 
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yn+h  -  1. 96 v/SPE(fc)  <  yn+h  <  yn+h  +  1.96V/SPE(^). 

The  forecast  intervals  are  wider  for  larger  horizons  b,  as  the  variance  SPE(^) 
increases  for  larger  values  of  b. 


Forecasting  an  MA  process 

The  above  forecasting  method  becomes  particularly  simple  for  an  MA (q)  model, 
as  <p{z)  =  1  in  this  case,  so  that  i j/(z)  =  6(z)  and  i =  6 &  for  k  <  q  and  i//k  =  0  for 
k  >  q.  For  example,  for  an  MA(2)  model  yt  =  a  +  0 \et~\  +  the  one-,  two-, 

and  three-step-ahead  forecasts  are 

yn+ 1  ^  t  b\£n  t  1>  yn+i  —  of  t  @2 &tty  y n+ 3  —  a- 

An  MA(2)  process  contains  no  information  (apart  from  the  mean)  on  future  values 
for  h  >  2.  This  is  sometimes  expressed  by  saying  that  an  MA(g)  process  has  a 
memory  of  length  q. 

□  Forecasting  in  practice 

In  the  foregoing  analysis  we  assumed  that  the  process  was  observed  since  the 
infinite  past  and  that  the  ARM  A  parameters  are  known.  In  practice  we  do  not 
know  these  parameters  and  the  time  series  is  observed  only  on  a  finite  time 
interval.  In  the  next  section  we  discuss  methods  to  estimate  the  ARMA  parameters 
from  the  observed  time  series.  The  forecast  function  (7.13)  can  then  be  approxi¬ 
mated  as  follows.  Estimates  of  the  coefficients  t j/k  are  determined  from  the  esti¬ 
mated  ARMA  parameters.  Further,  the  infinite  sum  in  (7.13)  is  replaced  by  a  finite 
sum  J2k=0  ^kk+h^n-b  where  the  terms  st  are  the  residuals  (the  estimated  innov¬ 
ations)  of  the  estimated  ARMA  model. 


E 


Example  7.6:  Simulated  Time  Series  (continued) 

As  an  illustration,  we  consider  forecasts  of  three  simulated  time  series  of 
foregoing  examples  —  namely,  for  the  AR(2)  series  of  Example  7.3  (see 
Exhibit  7.3  (c))  and  for  the  MA(2)  and  ARMA(1,1)  series  of  Example  7.4 
(see  Exhibit  7.4  (b)  and  (c)).  Exhibit  7.6  shows  the  forecasts  and  95  per  cent 
forecast  intervals  for  estimated  models  for  these  three  time  series.  The  models 
are  estimated  using  the  first  n  =  450  values  using  methods  to  be  discussed  in 
Section  7.2.2,  and  the  estimated  models  are  then  used  to  forecast  the  last  fifty 
values  of  the  series.  The  outcomes  are  as  expected.  The  forecasts  are  more 
accurate  for  short  forecast  horizons  h,  especially  for  the  MA  model,  and  the 
forecast  intervals  become  wider  for  larger  forecast  horizons.  The  intervals 
contain  the  majority  of  the  actual  values  in  the  forecast  period. 
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(a)  ( b ) 


455  460  465  470  475  480  485  490  495  500  455  460  465  470  475  480  485  490  495  500 


Exhibit  7.6  Simulated  Time  Series  (Example  7.6) 

Actual  series,  forecasts  (1-  to  50-step- ahead),  and  95%  forecast  intervals  for  three  time  series, 
AR(2)  (a),  MA(2)  (b),  and  ARMA(1,1)  (c). 

Exercises:  T:  7.4. 

7.1.7  Summary 

In  this  section  we  have  discussed  some  concepts,  models,  and  results  that 
are  needed  in  later  sections  to  construct  econometric  models  for  observed 
time  series. 

•  A  time  series  is  stationary  if  its  mean,  variance,  and  autocorrelations  are 
constant  over  time.  Stationary  time  series  can  be  described  by  autore¬ 
gressive  models,  by  moving  average  models,  and  by  mixed  ARMA 
models. 

•  Stationarity  requires  restrictions  on  the  autoregressive  parameters  (all 
roots  of  the  autoregressive  polynomial  should  lie  outside  the  unit  circle). 
Invertibility,  which  means  that  the  error  terms  in  an  ARMA  model  have 
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the  interpretation  of  one-step-ahead  forecast  errors,  requires  restrictions 
on  the  moving  average  parameters  (all  roots  of  the  moving  average 
polynomial  should  lie  outside  the  unit  circle). 

•  The  properties  of  stationary  time  series  can  be  summarized  in  terms  of 
the  autocorrelations  of  the  process.  A  moving  average  process  has  the 
property  that  the  autocorrelations  become  zero  after  a  certain  lag.  An 
autoregressive  process  is  characterized  by  the  fact  that  the  partial  auto¬ 
correlations  become  zero  after  a  certain  lag. 

•  For  given  parameters  of  the  ARMA  model,  future  values  of  the  time 
series  can  be  forecasted  from  past  values  by  exploiting  the  correlations 
that  are  present  between  successive  values  in  the  process. 
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7.2  Model  estimation  and 
selection 


Uses  Chapters  1-4;  Section  5.5;  parts  of  Sections  5.2,  5.3,  5.6;  Section  7.1. 


7.2.1  The  modelling  process 

Iterative  steps  in  modelling 

In  empirical  time  series  analysis,  a  model  is  often  obtained  in  an  iterative 
process  of  model  specification,  diagnostic  testing,  and  model  adjustments. 
This  was  discussed  in  Section  5.1  for  regression  models  (see  Exhibit  5.1 
(p.  276)).  Exhibit  7.7  summarizes  the  main  steps  in  time  series  modelling. 
Here  we  assume  that  the  purpose  of  the  model  is  to  produce  forecasts,  as  is 
often  the  case  in  time  series  analysis.  Further  we  assume  that  the  investigated 
time  series  is  stationary.  In  practice,  stationarity  is  often  achieved  after  appro¬ 
priate  data  transformations,  as  will  be  discussed  in  Section  7.3. 


Exhibit  7.7  Steps  in  modelling 

Iterative  method  of  ARMA  time  series  modelling. 
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Iterative  method  to  model  stationary  time  series 

•  Step  1:  Graphs  of  the  data.  Make  graphs  of  the  time  series  (time  plots, 
scatter  plots  against  lagged  values)  and  of  transformations  (like  logarithms 
and  first  differences).  This  gives  a  first  impression  of  the  properties  of  the 
series  —  for  instance,  the  presence  of  trends  and  cyclical  fluctuations.  In  the 
next  steps  it  is  assumed  that  the  modelled  time  series  is  stationary,  which 
sometimes  requires  appropriate  transformations  of  the  original  series. 

•  Step  2:  Choice  of  lag  structure.  Compute  the  sample  autocorrelations  and 
the  sample  partial  autocorrelations  to  get  an  impression  of  the  nature  of  the 
correlations  in  the  time  series.  This  gives  a  first  indication  how  to  choose  the 
orders  p  and  q  of  a  possibly  adequate  ARMA(p,  q)  model. 

•  Step  3:  Estimation  of  the  model  parameters.  For  the  selected  orders  p  and  q, 
estimate  the  parameters  of  the  ARMA(p,  q)  model. 

•  Step  4:  Diagnostic  checking.  Evaluate  this  model  by  diagnostic  tests.  In 
particular,  investigate  whether  the  model  captures  the  main  correlations  in 
the  time  series  and  whether  the  model  is  able  to  produce  reliable  forecasts. 

•  Step  5:  Improve  the  model.  If  the  results  of  step  4  indicate  that  the  model  is 
not  satisfactory,  then  repeat  steps  1-4  (graphs  of  the  data,  choice  of  lag 
structure,  estimation,  and  diagnostic  checking).  The  model  can  be  adjusted 
along  the  lines  suggested  by  the  outcomes  of  the  diagnostic  tests.  This  may 
lead  to  models  other  than  ARMA  —  for  instance,  models  with  trends  (see 
Section  7.3)  or  non-linear  models  (see  Section  7.4). 

•  Step  6:  Use  the  model.  Finally,  when  the  final  model  performs  well  enough, 
it  can  be  used,  for  instance,  to  produce  out-of-sample  forecasts. 


Steps  1  and  2  in  this  process  constitute  the  so-called  model  identification 
phase.  This  is  an  important  phase,  as  the  main  problem  in  time  series 
modelling  is  often  to  find  a  good  specification  of  the  model.  In  practice  the 
diagnostic  tests  in  step  4  help  to  construct  a  sequence  of  models  where  each 
new  model  improves  upon  the  former  ones. 


Overview  of  Section  7.2 

Parts  of  steps  1  and  2  of  the  above  iterative  modelling  method  were  discussed 
in  Section  7.1  (see  in  particular  Section  7.1.5  on  the  identification  of  the  lags 
of  AR  and  MA  models).  Section  7.2.3  describes  some  additional  methods  for 
model  identification.  In  Section  7.2.2  we  consider  step  3,  the  estimation  of  a 
given  ARMA  model.  Steps  4  and  5,  diagnostic  tests  and  their  use  in  model 
improvement,  are  discussed  in  Section  7.2.4.  Finally,  the  forecasting  of  time 
series  in  step  6  has  already  been  treated  in  Section  7.1.6.  In  Section  7.2.4  we 
present  diagnostic  tools  to  evaluate  the  forecast  performance  of  the  model  on 
a  set  of  evaluation  data,  prior  to  the  actual  application  of  the  model  in 
forecasting  the  future. 
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Example  7.7:  Industrial  Production  (continued) 

We  return  to  the  series  of  industrial  production  in  the  USA  described  in 
Example  7.1.  We  will  discuss  (i)  step  1  and  (ii)  step  2  of  the  above  modelling 
method. 

(i)  Step  1:  graphs  of  the  industrial  production  data 

This  first  step  has  already  been  made  in  Example  7.1  (see  Exhibit  7.1). 
The  original  series  shows  exponential  growth.  After  we  have  taken  loga¬ 
rithms,  the  resulting  series  yt  shows  a  more  or  less  linear  growth  path  with 
some  seasonal  fluctuations.  A  time  series  that  looks  more  like  a  stationary 
series  is  obtained  by  considering  the  quarterly  series  of  yearly  growth 
rates,  denoted  by  A4 yt  =  yt  —  yt-4-  A  time  plot  of  this  series  was  given  in 
Exhibit  7.1  (c),  and  the  series  shows  more  or  less  regular  fluctuations  around 
a  stable  mean. 

(ii)  Step  2:  identification  of  lag  structure 

To  get  some  idea  of  the  involved  correlations  in  the  time  series,  Exhibit  7.8 
shows  the  first  twelve  sample  autocorrelations  and  sample  partial  auto¬ 
correlations  of  this  series.  The  SACF  dies  out  more  slowly  than  the  SPACF, 
and  the  SPACF  values  are  small  from  lag  three  onwards.  As  a  first  guess, 
this  suggests  specifying  an  AR(2)  model  for  the  series  ^yt  of  growth  rates. 
We  will  investigate  in  the  rest  of  this  section  whether  this  model  gives 
an  acceptable  description  of  this  time  series  (see  Examples  7.8,  7.10, 
and  7.11). 


Lag 

SACF 

SPACF 

1 

0.851 

0.851 

2 

0.594 

-0.466 

3 

0.319 

-0.119 

4 

0.072 

-0.083 

5 

-0.082 

0.117 

6 

-0.176 

-0.134 

7 

-0.236 

-0.104 

8 

-0.264 

-0.030 

9 

-0.234 

0.159 

10 

-0.182 

-0.098 

11 

-0.152 

-0.186 

12 

-0.121 

0.062 

Exhibit  7.8  Industrial  Production  (Example  7.7) 

Sample  autocorrelations  (SACF)  and  partial  autocorrelations  (SPACF)  of  quarterly  series  of 
yearly  growth  rates  of  US  industrial  production  (A43V). 
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Exercises:  E:  7.17a,  7.18a. 
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7.2.2  Parameter  estimation 
OLS  estimator  of  AR(1)  model 

In  this  section  we  discuss  the  estimation  of  ARMA(p,  q)  models  with  given 
orders  for  p  and  q.  For  simplicity  of  the  exposition  we  restrict  the  attention  to 
processes  with  zero  mean.  The  results  can  easily  be  extended  to  time  series 
with  non-zero  mean  by  including  a  constant  term  in  the  model.  First  we 
consider  the  stationary  AR(  1 )  model 


yt  =  (j)yt_  i+ef,  t  =  2,  •••,«.  (7.14) 

Here  —  1  <  </>  <  1,  and  we  assume  for  simplicity  that  the  innovations  st  are 
normally  distributed.  This  model  has  the  form  of  a  regression  model,  but  the 
regressor  yt-\  is  stochastic.  The  OLS  estimator  of  </>  is  given  by 


i  Zhvt-iyt 

0  ElUjii  ' 


Derivation  of  asymptotic  distribution  of  OLS  estimator 

As  the  finite  sample  distribution  of  <j)  is  rather  involved,  one  usually  takes  the 
asymptotic  distribution  as  an  approximation  in  sufficiently  large  samples.  By 
substituting  the  expression  (7.14)  for  yt,  it  follows  that 


Vn((j>  -  <t>) 


1  \—\n 

2  34-i£f 
—  IV"  id 

n  Z^f=2  Jt-\ 


(7.15) 


Some  details  of  the  next  steps  are  left  as  an  exercise  (see  Exercise  7.12).  The  term 
1 5Z”-2  y)L|  in  the  denominator  of  (7.15)  is  the  sample  mean  of  the  correlated 
terms  yj_ , .  As  the  correlations  pk  =  E[ytyt_k]  =  <j)k  converge  to  zero  exponentially 
fast,  the  probability  limit  of  this  term  exists  and  plim(l^”=2  yj-i)  =  E[yj_t]  =  y0. 
Next  we  consider  some  properties  of  the  sample  average  related  to 

the  numerator  of  (7.15).  This  is  the  sample  mean  of  uncorrelated  terms,  because 
for  t  >  s  the  error  term  Et  is  uncorrelated  with  yt-i,  ys-i,  and  ss,  so  that 
E[yf-iStys-ies]  =  E[yf_iys_i£s]T[£d  =  0.  Each  term  has  expected  value  E[yt^tst\ 
=  E[yt_ i]E[£f]  =  0  and  variance  Elyj^sj]  =  =  y0<r2.  Then  the  cen¬ 

tral  limit  theorem  (see  Section  1.3.3  (p.  50))  implies  that  the  numerator  of  (7.15) 
has  the  property  that 


1  «  i  n  , 

34-16 1  =  Vn-'y'yt-iEt  N(0,y0(72). 


7.2  Model  estimation  and  selection  559 


Combining  the  above  results  on  the  numerator  and  denominator  of  (7.15), 
and  using  the  fact  that  y0  =  o1  / (l  —  02),  we  conclude  that  i/ra(0  —  0)  con¬ 
verges  in  distribution  to  a  normal  distribution  with  mean  zero  and  variance 

y^2hl  =  °2  ho  =  l-<t>2.  That  is, 


-  <p)  4  N(0, 1  -  <j>2).  (7.16) 

So  the  least  squares  estimator  is  consistent  and  has  an  asymptotic  normal  distri¬ 
bution. 


Approximate  distribution  of  OLS  estimator 

It  follows  from  the  foregoing  results  that  the  approximate  finite  sample 
distribution  of  the  OLS  estimator  0  in  the  AR(1)  model  is  given  by 


0 


The  hypothesis  that  the  time  series  yt  is  uncorrelated  —  that  is,  that  0  =  0  — 
can  be  tested  by  the  t- test.  The  correlations  are  not  significant  (at  an  approxi¬ 
mate  5  per  cent  significance  level)  if  0  is  less  than  2/yJn  in  absolute  value. 
Note  that  for  values  of  cj>  «  I  the  asymptotic  variance  1  —  4>2  approaches 
zero.  This  suggests  that  for  0  =  1  the  OLS  estimator  converges  at  a  higher 
speed  than  y/n.  This  is  indeed  the  case,  as  will  be  further  discussed  in 
Section  7.3.3. 


Estimation  of  ARfpJ  models 

The  parameters  of  a  stationary  AR(p)  model 


yt  =  +  4>iyt-i  H - f  4>pyt-P  +  et,  t  =  p  +  l,---,n 

can  also  be  estimated  by  OLS.  The  stationarity  condition  means  that  the 

model  can  be  written  as  an  infinite  moving  average  as  in  (7.11).  This  implies 
that  the  error  term  et  is  uncorrelated  with  the  p  x  1  vector  of  regressors 
xt  =  (yt-i,  ■  ■  ■  ,yt-P)'.  Therefore  the  orthogonality  condition  is  satisfied  — 
that  is,  the  regressors  yt_p  in  the  AR(p)  model  are  exogenous  for  all 

k  =  1,  ■  ■  •  ,p.  Further,  as  the  process  yt  is  stationary,  the  matrix  of  second 

order  moments  Q„  =  ^  Yl't=P+ 1  xtx't  converges  in  probability  to  the  corres¬ 
ponding  matrix  Q  of  population  moments  that  is  non-singular  (see  Exercise 
7.3).  It  follows  from  the  results  in  Section  4.1.4  (p.  197)  that  the  OLS 
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estimators  are  consistent  and  that  the  covariance  matrix  can  be  approxi¬ 
mated  by  er2(  Y"=p+i xtx't)  .  Under  the  assumption  of  normality,  the  OLS 
estimators  coincide  with  the  ML  estimators  (or,  better,  the  conditional  ML 
estimators,  treating  the  initial  values  yi,  •  •  •  ,yp  as  fixed). 

Estimation  of  MA(1)  model 

The  estimation  of  models  with  moving  average  terms  is  somewhat  more 
involved.  To  illustrate  this  we  first  consider  the  MA(1)  model 


Vt  —  Et  +  ®Et- 1 


with  —  1  <  9  <  1.  The  parameter  9  cannot  be  estimated  by  regressing  yt  on 
£f_i,  because  the  regressor  £f_i  is  not  directly  observed.  Since —1  <6  <  l,the 
model  is  invertible,  so  that  we  can  express  st-i  in  terms  of  the  past  observa¬ 
tions  yt-ki  k  >  1,  by  means  of  (7.9).  Substituting  this  expression  for  st- \ 
(that  is,  applied  at  time  instant  t  —  1)  in  the  MA(1)  model,  we  obtain 

Jt  =  (9yt-  i  —  02yt- 2  +  93yt~  3  —  •••)  +  £t- 

This  is  a  non-linear  regression  model.  The  parameter  9  can  be  estimated  by 
NLS  after  truncating  YV=o  ®)kyt-i-k  to  the  finite  sum  Y^k= 0  (—  ®)kyt- i-k- 


Estimation  of  ARMA  models  by  NLS  and  ML 

Stationary  and  invertible  ARMA  models  can  be  estimated  in  a  similar  way.  The 
ARMA(p,  q)  model  (f>{L)yt  =  9(L)st  can  be  written  in  the  form  of  the  (infinite) 
regression  model  (7.2)  so  that 


OO 

yt  =  5Z  nkyt-k  +  £f- 

k=\ 


The  parameters  n/,  are  (non-linear)  functions  of  the  (p  +  q )  ARMA 
model  parameters.  If  we  write  n (z)  =  1  —  Y^kLi  nkZk,  then  n(L)yt  =  st  so 
that  < p(L)yt  =  6(L)s(t)  =  0(L)n(L)yt.  So  the  relation  between  the  regres¬ 
sion  parameters  74  and  the  (p  +  q)  ARMA  model  parameters  4>k  and  9 *  are 
given  by 


(f>(z)  =  6(z)n(z). 

This  equation  (in  terms  of  polynomials)  can  be  used  to  compute  the  values  of  n ^  for 
given  values  of  the  ARMA  parameters  <j)k  and  9^.  If  the  infinite  regression  is 
truncated  to  a  finite  regression  yt  «  Y^k-t  nkyt-k  +  Et  (with  m  >  p  +  q),  then  the 
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ARMA  parameters  of  (j>(z)  and  0(z)  can  be  estimated  by  NLS.  If  the  innovations  Et 
are  assumed  to  be  normally  distributed,  then  asymptotically  NLS  is  equivalent  to 
ML.  Indeed,  suppose  that  Et  ~  NID(0,  a2)  and  that  the  ARMA  model  is  rewritten 
as  n (L)yt  =  et  as  before.  Then  the  conditional  log-likelihood  of  the  ARMA  model 
(treating  the  initial  values  as  fixed)  is  given  by 

.  .  -t  n 

log(L(<p1,---,<pp,91,---,9q,a2))=  ~^yMog(27i) -^yMog(o-2) et- 

t=p+ 1 

Conditional  ML  corresponds  to  maximization  of  this  function  with  respect  to  the 
parameters.  So  the  ARMA  parameters  are  obtained  by  minimizing  ^  e2,  which 
shows  that  ML  in  this  case  is  equivalent  to  NLS  for  n  — ■>  oo  (as  the  effect  of 
treating  the  initial  values  as  fixed  then  vanishes,  provided  that  the  roots  of  the 
MA  polynomial  are  not  too  close  to  the  unit  circle,  as  otherwise  the  effect  of  initial 
values  vanishes  only  very  slowly).  Standard  errors  of  the  estimates  can  be  obtained 
as  usual  from  the  information  matrix,  and  tests  (such  as  t-,  F-,  and  LR- tests)  can 
be  performed  in  the  usual  way. 


Iterative  estimation  methods 

Instead  of  direct  optimization  of  the  (non-linear)  log-likelihood,  one  can  also 
apply  simpler  iterative  methods.  As  an  illustration  we  consider  the  stationary 
and  invertible  ARMA(1,1)  model 


yt  =  4>yt-\  +  £t  +  Qzt- 1- 

The  main  idea  is  to  estimate  the  parameters  by  two  iterative  regression  steps,  a  first 
step,  where  4>  is  estimated  for  given  value  of  9,  and  a  second  step,  where  9  is 
estimated  for  given  value  of  (/>.  The  model  can  be  written  as  yt  =  (f>yt- \  +  9(L)st, 
where  9(L)  =  1  +  9L.  As  the  model  is  invertible,  so  that  —  1  <  9  <  1,  it  follows 
that  9(z)oc(z )  =  1,  where  a (z)  =  1/(1  +  9z)  =  Y^k=o  (~  9)kzk.  In  the  first  step  it  is 
assumed  that  the  MA  parameter  9  is  known.  Define  the  process  xt  by  xt  =  a(L)yt, 
so  that  xt  =  —9xt- 1  +  yt-  As  starting  condition  we  take  yo  =  0,  so  that 
xt,  t  =  1,  •  •  • ,  n,  can  be  computed  from  the  observed  time  series  yt,  t  =  1,  •  •  • ,  n, 
for  given  value  of  9.  Then  xt  follows  an  AR(1)  process,  since  xt  —  <f>xt- 1  = 
(1  —  q *)L)xt  =  (1  —  (f)L)a(L)yt  =  a(L)9(L)et  =  st  —  that  is, 


Xt  =  (f>Xt- 1  +  Ef 


In  this  AR(1)  model,  can  be  estimated  by  OLS.  Let  the  corresponding  OLS 
residuals  be  denoted  by  et  =  xt  —  <fixt-i.  In  the  second  step,  for  given  value  of  <j), 
the  MA  parameter  9  can  be  estimated  by  regressing  yt  —  <J)yt- 1  on  et-i  in  the 
regression  model 


(yt  -  <j>yt- 1)  =  9et- 1  +  Et. 
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The  estimated  value  of  6  can  be  used  to  perform  step  1  again  to  obtain  a  new 
estimate  of  (j),  which  can  be  used  to  perform  step  2  again,  and  so  on  until  the 
estimates  converge.  To  start  the  iterations,  we  can  take  9  =  0  in  step  1,  so  that 
xt  =  yt  in  this  first  round.  The  advantage  of  the  above  method  is  that  the  estimates 
are  obtained  by  iterative  (linear)  regressions,  whereas  ML  and  NLS  need  non¬ 
linear  optimization  methods.  Similar  methods  can  be  followed  for  ARMA  models 
with  higher  orders  p  and  q. 


Example  7.8:  Industrial  Production  (continued) 

We  continue  our  analysis  of  the  quarterly  series  of  yearly  growth  rates  \^yt 
of  US  industrial  production.  In  Example  7.7  we  discussed  steps  1  and  2  of 
the  modelling  process  of  Section  7.2.1.  Now  we  consider  step  3  —  that  is,  the 
estimation  of  the  model  parameters.  We  will  discuss  (i)  the  estimates  of 
the  AR(2)  model  and  (ii)  an  interpretation  of  the  estimated  model. 

(i)  Estimates  of  AR(2)  model 

For  reasons  discussed  in  Section  7.2.1,  an  AR(2)  model  is  postulated  for  the 
series  A4 yt.  Therefore  we  estimate  an  AR(2),  and  we  include  an  intercept 
because  the  average  growth  rate  is  non-zero.  The  parameters  of  the  AR(2) 
model  are  estimated  by  regressing  A4 yt  on  a  constant  and  the  two  lagged 
values  A4)y_i  and  A4yf_2.  The  result  is  in  Exhibit  7.9  and  can  be  summarized 
as  follows  (asymptotic  standard  errors  of  the  parameters  are  in  parentheses). 

A 4yt  =  0.007  +  1.332A4yr_i  -  0.546A4y;_2  +  et. 

(0.002)  (0.072)  (0.072) 

The  considered  data  period  is  1961-94,  with  n  —  4  ■  34  =  136  observations. 
As  was  mentioned  in  Example  7.1,  the  values  of  yt  are  also  known  prior  to 
1961.  This  allows  us  to  estimate  the  above  regression  equation  using  n  =  136 
observations,  since  in  the  initial  period  the  values  of  the  required  lagged 
regressors  (which  involve  values  of  yt  back  to  1959.3  to  compute  the  first 
value  of  A4jy_2)  are  known. 

(ii)  Interpretation  of  the  estimated  model 

The  estimated  AR(2)  polynomial  4>(z)  =  1  —  1.332z  +  0.546z2  can  be  factor¬ 
ized  as  in  (7.5).  This  gives  4>(z)  =  (1  —  aiz)(l  —  a2z)  with  aq2  = 
0.68  ±0.32 z,  so  that  |ai  2|  lies  well  below  1.  This  provides  some  support 
for  the  stationarity  of  the  series  A4 yt.  The  stationarity  of  this  series  will  be 
investigated  by  means  of  statistical  tests  in  Example  7.16.  The  AR(2)  model 
can  be  used  to  determine  the  average  annual  growth  rate  —  that  is,  to 
estimate  E[A4y^.  It  follows  from  (7.17)  that  this  growth  rate  (in  percentages) 
is  estimated  as 
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Dependent  Variable:  D4Y 

Method:  Least  Squares 

Sample:  1961:1  1994:4 

Included  observations:  136 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

0.007147 

0.002161 

3.307447 

0.0012 

D4Y(— 1) 

1.332025 

0.072094 

18.47633 

0.0000 

D4Y(— 2) 

-0.545933 

0.072174 

-7.564120 

0.0000 

R-squared 

0.821380 

Mean  dependent  var 

0.032213 

Adjusted  R-squared 

0.818694 

S.D.  dependent  var 

0.049219 

S.E.  of  regression 

0.020958 

Akaike  info  criterion 

-4.870824 

Sum  squared  resid 

0.058416 

Schwarz  criterion 

-4.806574 

Log  likelihood 

334.2160 

F-statistic 

305.7993 

Durbin-Watson  stat 

2.050254 

Prob(F-statistic) 

0.000000 

Exhibit  7.9  Industrial  Production  (Example  7.8) 

AR(2)  model  for  the  quarterly  series  of  yearly  growth  rates  of  US  industrial  production. 


100  ■- 
1 


0.007 

1.332  +  0.546 


3.34%. 


The  quality  of  the  AR(2)  model  for  this  series  will  be  further  evaluated  in  the 
next  two  sections,  where  we  will  also  consider  alternative  ARMA  models 
(see  Examples  7.10  and  7.11). 


Exercises:  T:  7.2a,  7.3d;  S:  7.12a-e. 


7.2.3  Model  selection 
Identification  of  ARMA  model  orders 

The  estimation  of  ARMA  models  requires  first  that  the  orders  p  of  the 
autoregressive  part  and  q  of  the  moving  average  part  are  chosen.  The  choice 
of  these  orders  is  called  the  identification  of  the  ARMA  model  (see  step  2  of 
the  modelling  method  described  in  Section  7.2.1).  We  will  now  discuss  some 
tools  for  selecting  the  model  orders  p  and  q;  related  diagnostic  tests  are 
described  in  the  next  section.  The  results  in  Section  7.1.5  show  that  the 
sample  (partial)  autocorrelations  are  helpful  for  selecting  the  orders  of  MA 
and  AR  models.  The  theoretical  ACF  becomes  zero  for  an  MA  process  and 
the  theoretical  PACF  becomes  zero  for  an  AR  process.  These  correlations  can 
be  estimated  from  the  sample  by  the  SACF  (denoted  by  rf)  and  SPACF 
(denoted  by  <pkk)  (see  Section  7.1.5).  To  select  the  orders  we  can  plot  the 
correlations  r \  and  <j>kk  against  the  time  lag  k.  The  plot  of  r \  is  called 
the  correlogram. 
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Significance  of  the  first  order  sample  autocorrelation  coefficient 

One  way  to  select  the  orders  of  an  AR  or  MA  model  is  to  test  whether  the 
(partial)  correlations  differ  significantly  from  zero.  For  this  purpose  we  need 
to  know  the  (asymptotic)  distribution  of  the  sample  (partial)  autocorrela¬ 
tions.  We  derive  this  distribution  for  the  first  order  sample  autocorrelation  r\ 
of  a  white  noise  process.  The  value  of  r\  is  obtained  by  regression  in  the 
AR(1)  model  yt  —  a  +  4>yt- 1  +  Et,  and  the  asymptotic  distribution  is  given 
in  (7.16).  A  white  noise  process  has  4>  =  0,  so  that  in  large  enough  samples 
there  holds 


r  i 


if  yt  is  white  noise. 


The  null  hypothesis  of  no  autocorrelation  (p1  =  0)  can  be  tested  against  the 
alternative  p1  ^  0.  At  (approximate)  5  per  cent  level,  the  null  hypothesis  is 
rejected  if 


r  l 


In  this  case  the  first  order  autocorrelation  is  significant. 

Significance  of  S(P)ACF  for  AR  and  MA  processes 

The  first  order  sample  partial  autocorrelation  <p11  is  obtained  by  the  regres¬ 
sion  (7.12)  with  k  =  l — that  is,  by  regression  in  the  model  yt  —  a  + 
4>nyt-i  +  £f.  This  means  that  4>n  =  r\,  so  that  cj)n  ~  N(0, 1/n)  if  the  process 
is  white  noise.  Similar  results  hold  true  for  higher  order  sample  (partial) 
autocorrelations.  It  can  be  shown  that,  for  an  MA(g)  process,  the  SACF  r ^ 
for  k  >  q  are  approximately  normally  distributed  with  mean  zero  and 
variance 


MAfg) :  var(r^) 


1  +  2 


r 2 
=1  ri 


for  all  k  >  q. 


The  significance  of  r k  can  be  tested  by  the  t-test.  If  the  SACF  are  not 
significant  beyond  lag  q,  this  indicates  that  an  MA(<j)  model  may  be  appro¬ 
priate  for  the  time  series.  For  an  AR (p)  process,  the  SPACF  for  k  >  p  are,  for 
large  enough  samples,  approximately  normally  distributed  with  mean  zero 
and  variance 


AR(p) :  var ((j)kk) 


1 

n  ’ 


for  all  k  >  p. 
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The  significance  can  again  be  tested  by  the  f-test.  If  the  SPACF  are  not 
significant  beyond  lag  p,  this  means  that  an  AR (p)  model  can  be  appropriate 
to  describe  the  time  series. 

Model  selection  by  means  of  tests  and  information  criteria 

The  foregoing  methods  can  be  used  to  select  the  order  of  an  AR  model  or  of 
an  MA  model.  In  general  it  is  not  easy  to  select  the  orders  of  a  mixed  ARMA 
model  from  the  S(P)ACF.  One  can  instead  estimate  a  collection  of  ARMA 
models,  for  different  orders  of  p  and  q,  and  then  select  a  model  from  this 
collection.  Competing  models  can  be  compared  by  tests  or  by  using  infor¬ 
mation  criteria.  If  two  competing  models  are  nested,  in  the  sense  that  one 
model  is  a  restriction  of  the  other  model  (for  example  an  AR(2)  model  versus 
an  ARMA(2,2)  model),  then  one  could  select  the  best  model  by  performing 
an  F- test  or  an  LR- test  on  the  parameter  restrictions.  If  competing  models  are 
not  nested  in  this  way,  then  one  can  use  information  criteria  like  AIC  and 
SIC,  as  discussed  in  Section  5.2.1  (p.  279),  where  the  number  of  parameters 
of  an  ARMA(p,  q)  model  (with  constant  term)  is  k  =  p  +  q  +  1.  One  then 
chooses  the  model  that  minimizes  AIC  or  SIC. 

Example  7.9:  Simulated  Time  Series  (continued) 

To  illustrate  step  2  of  the  modelling  process,  we  use  the  S(P)ACF  of  the  six 
simulated  time  series  of  Examples  7.3  and  7.4  to  identify  the  model  orders. 
The  SACF  and  SPACF  of  these  six  time  series  are  given  in  Panels  1  and  2  of 
Exhibit  7.5.  The  simulated  time  series  have  length  n  —  500,  so  that  the 
approximate  5  per  cent  critical  value  for  S(P)ACF  is  2/V500  =  0.089.  For 
the  white  noise  series,  some  of  the  correlations  (for  instance,  r3  and  </>33)  are 
marginally  significant,  but  none  of  the  correlations  is  far  above  0.089.  The 
S(P)ACF  indeed  suggests  that  the  series  is  white  noise.  For  the  AR(1)  process 
only  the  first  SPACF  is  highly  significant,  and  for  the  AR(2)  process  only  the 
first  and  second  SPACF  are  highly  significant.  So  the  AR  processes  are  well 
identified  by  the  SPACF.  Similar  results  hold  true  for  the  SACF  of  the  MA(  1 ) 
and  MA(2)  process.  For  the  ARMA(1,1)  process  many  of  the  S(P)ACF  are 
significant,  so  that  this  series  is  not  well  described  by  (low  order)  AR  or  MA 
models.  This  indicates  that  the  model  is  of  mixed  ARMA  type. 

Example  7.10:  Industrial  Production  (continued) 

We  continue  our  analysis  of  the  quarterly  series  of  annual  growth  rates  in  US 
industrial  production  (see  also  Examples  7.7  and  7.8).  We  now  consider  step 
2  of  the  modelling  process  of  Section  7.2.1  in  more  detail  for  this  series.  We 
will  discuss  (i)  the  sample  (partial)  autocorrelations  of  the  series  and  (ii)  a 
comparison  of  two  models:  AR(2)  and  ARMA(2,5). 


E 
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(i)  Sample  (partial)  autocorrelations 

Exhibit  7.8  contains  the  first  twelve  S(P)ACF  of  the  quarterly  series  of  yearly 
growth  rates  in  US  industrial  production.  The  quarterly  data  are  considered 
over  the  years  1961-94,  giving  n  =  136  observations.  Therefore  the  standard 
error  of  the  SPACF  is  approximately  1  / y/n  =  0.086,  and  SPACFs  are  signifi¬ 
cant  if  they  are  (in  absolute  value)  larger  than  0.172.  The  SACF  in  Exhibit  7.8 
displays  a  somewhat  cyclical  pattern,  and  the  SPACF  suggests  that  an  AR(2) 
model  is  a  good  starting  point  because  the  SPACFs  for  lags  3-12  are  relatively 
small.  Only  the  eleventh  SPACF  is  significant,  but  this  has  no  intuitive 
meaning  and  may  be  due  to  random  effects.  Note  that,  at  5  per  cent  signifi¬ 
cance  level,  on  average  one  out  of  twenty  sample  correlations  may  be 
significant  if  the  theoretical  correlations  are  zero. 

(ii)  Comparison  of  two  models:  AR(2)  and  ARMA(2,5) 

The  AR(2)  model  was  estimated  in  Example  7.8.  As  an  alternative  we  con¬ 
sider  an  ARMA(2,5)  model  for  these  data,  so  that  (j>{L) ^\yt  =  a  +  9(L)st , 


Panel  1:  Dependent  Variable:  D4Y 

Method:  Least  Squares 

Sample:  1961:1  1994:4 

Included  observations:  136 

Convergence  achieved  after  23  iterations 

Backcast:  1959:4  1960:4 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

0.001657 

0.000607 

2.729598 

0.0072 

D4Y(  — 1) 

1.395914 

0.136396 

10.23425 

0.0000 

D4Y(— 2) 

-0.460286 

0.126571 

-3.636585 

0.0004 

MA(1) 

-0.076816 

0.125279 

-0.613155 

0.5409 

MA(2) 

0.054674 

0.037878 

1.443424 

0.1513 

MA(3) 

-0.049391 

0.035496 

-1.391466 

0.1665 

MA(4) 

-0.867788 

0.037032 

-23.43334 

0.0000 

MA(5) 

-0.022209 

0.115754 

-0.191859 

0.8482 

R-squared 

0.880726 

Mean  dependent  var 

0.032213 

Adjusted  R-squared 

0.874203 

S.D.  dependent  var 

0.049219 

S.E.  of  regression 

0.017457 

Akaike  info  criterion 

-5.201129 

Sum  squared  resid 

0.039008 

Schwarz  criterion 

-5.029796 

Log  likelihood 

361.6768 

F-statistic 

135.0226 

Durbin-Watson  stat 

1.979721 

Prob(F-statistic) 

0.000000 

Panel  2:  Wald  Test 

Null  Hypothesis:  C(4)  =  0,  C(5)  =  0,  C(6)  =  0,  C(7)  =  0,  C(8)  =  0 
F-statistic  12056652  Probability  0.000000 

Chi-square  60283262  Probability  0.000000 


Exhibit  7.10  Industrial  Production  (Example  7.10) 

ARMA(2,5)  model  for  the  quarterly  series  of  yearly  growth  rates  of  US  industrial 
production  (Panel  1;  the  two  lagged  values  of  D4Y  before  1961.1  are  available  and  the 
five  lagged  values  of  the  error  terms  before  1961.1  are  ‘backcasted’  from  the  model)  and 
test  on  the  joint  significance  of  the  5  MA  terms  (Panel  2). 
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where  the  AR  polynomial  <j)(z)  has  degree  2  and  the  MA  polynomial  6(L)  has 
degree  5.  Panel  1  of  Exhibit  7.10  shows  the  ML  estimates  of  this  model.  If  we 
compare  the  AIC  and  SIC  of  the  ARMA(2,5)  model  in  Panel  1  of  Exhibit 
7.10  with  the  AIC  and  SIC  of  the  AR(2)  model  in  Exhibit  7.9,  we  see  that 
both  these  selection  criteria  favour  the  ARMA(2,5)  model.  As  the  AR(2) 
model  is  a  restriction  of  the  ARMA(2,5)  model,  we  can  test  the  null  hypoth¬ 
esis  of  an  AR(2)  model  (the  restricted  model)  against  the  alternative  of  an 
ARMA(2,5)  model.  Panel  2  of  Exhibit  7.10  shows  the  outcome  of  the  E-test 
on  the  joint  significance  of  the  five  MA  terms.  This  test  has  the  F{g,  n  —  k) 
distribution,  where  g  =  5  is  the  number  of  restrictions  (the  five  MA  terms), 
k  =  8  is  the  number  of  parameters  of  the  ARMA(2,5)  model  (including  a 
constant  term),  and  n  =  4  ■  34  =  136  is  the  number  of  observations  (note 
that  pre-sample  values  before  1961  are  available;  see  also  our  remarks  on  this 
point  in  Example  7.8).  The  test  shows  that  the  five  MA  terms  are  jointly 
significant.  If  we  use  the  log-likelihood  values  in  Exhibit  7.9  and  Panel  1  of 
Exhibit  7.10,  the  LR-test  gives  LR  =  2(361.68  -  334.22)  =  54.92  with  P- 
value  (obtained  from  the  x2(5)  distribution)  equal  to  P  =  0.0000.  So  this 
leads  to  the  same  conclusion  —  that  is,  the  ARMA(2,5)  model  is  preferred 
above  the  AR(2)  model.  However,  as  we  shall  see  in  Example  7.11,  the 
ARMA(2,5)  model  corresponds  to  over-fitting,  which  leads  to  worse  per¬ 
formance  in  prediction  as  compared  to  the  AR(2)  model. 


7.2.4  Diagnostic  tests 

Overview 

Once  the  model  orders  have  been  selected  and  the  parameters  of  the  model 
have  been  estimated,  the  resulting  model  should  be  tested  to  see  whether  the 
model  is  correctly  specified.  Here  we  discuss  some  of  the  main  diagnostic  tools 
for  time  series  models.  Some  of  these  tools  are  based  on  the  diagnostic  tests  for 
regression  models  discussed  in  Chapter  5,  in  particular  tests  based  on  the 
model  residuals  and  tests  based  on  the  predictive  performance  of  the  model. 
The  discussion  of  some  specific  time  series  tests  is  postponed  till  later  sections. 
In  Section  7.3.3  we  consider  tests  for  the  presence  and  nature  of  trends,  and  in 
Section  7.3.4  we  discuss  tests  for  outliers  in  time  series  and  for  time  varying 
variance. 

Check  on  stationarity 

As  a  first  step  one  should  check  that  the  modelled  time  series  is  stationary,  as 
all  tests  discussed  below  are  based  on  this  assumption.  A  time  plot  of  the 
series  is  useful  to  see  whether  the  mean  level  and  the  variance  are  more  or  less 
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stable  over  time.  One  can  also  split  the  time  series  into  two  parts  and  check 
whether  the  mean,  variance,  and  autocovariances  are  comparable  in  the  two 
periods.  Many  time  series  in  business  and  economics  show  changes  in  levels 
due  to  trends  and  seasonals.  Such  aspects  should  then  first  be  modelled  by 
methods  to  be  discussed  in  Section  7.3. 


Graphical  inspection  of  residuals 

The  model  selection  methods  of  Section  7.2.3  and  the  estimation  methods  in 
Section  7.2.2  are  based  on  the  assumption  that  the  error  terms  of  the  ARMA 
model  satisfy  the  standard  assumption  that  et  ~  NID(0,  a2).  That  is,  the 
error  terms  are  assumed  to  have  constant  mean  zero  and  constant  variance, 
and  they  are  uncorrelated  and  normally  distributed. 

It  is  always  helpful  to  use  graphical  tools  as  a  first  step  to  analyse  the 
residuals,  as  this  may  indicate  possible  defects  of  the  model.  The  time  plot  of 
the  residuals  shows  the  mean  and  variance  over  time,  the  correlogram  can  be 
used  to  check  for  residual  correlation,  and  the  histogram  of  the  residuals  can 
be  compared  with  the  normal  distribution. 


Check  for  serial  correlation 

As  the  time  series  model  tries  to  capture  the  correlations  over  time,  it  is  of 
particular  importance  to  test  for  serial  correlation  of  the  model  residuals 
et,  t  =  1,  •  •  • ,  n.  The  residual  autocorrelations  r^e),  for  k  >  1,  are  given  by 


n(e) 


Z^t=k+ 1  etet-k 
£f=i  ei 


In  a  correctly  specified  model  the  parameters  are  estimated  consistently  and 
the  residuals  et  converge  to  the  innovations  st.  Asymptotically,  the  residuals 
are  then  uncorrelated  and  rk(e)  has  mean  zero  and  variance  1/n.  The  auto¬ 
correlations  are  not  significant  (at  approximate  5  per  cent  significance  level) 
if  they  are  within  the  interval 


2  2 

~—j=  <  rk(e)  <—/=■ 
\ Jn  Jn 


Serial  correlation  test  of  Ljung-Box 

The  joint  significance  of  the  first  m  residual  autocorrelations  can  be  tested  by 
the  Ljung-Box  test  of  Section  5.5.3  (p.  365)  —  that  is, 


m  .  ^ 

- j-r|(e)  ~  X  (m  —  P  —  q)- 

k=in~k 
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Under  the  null  hypothesis  that  the  estimated  ARMA(p,  q)  model  is  correctly 
specified,  this  test  asymptotically  follows  the  y2(m  —  p  —  q)  distribution. 
Note  that,  in  contrast  with  this  test  for  the  regression  model  in  Section 
5.5.3,  now  (p  +  q)  degrees  of  freedom  are  lost.  This  is  because  the 
ARMA(p,  q)  model  has  (p  +  q)  parameters  that  affect  the  serial  correlation 
of  the  model  residuals. 

Serial  correlation  test  of  Breusch-Godfrey 

Another  useful  test  for  residual  autocorrelation  is  obtained  by  applying  the 
Breusch-Godfrey  LM- test  for  serial  correlation  in  regression  models,  de¬ 
scribed  in  Section  5.5.3  (p.  364).  If  the  estimated  model  is  an  AR (p)  model 
with  residuals  et,  then  the  test  for  serial  correlation  is  based  on  a  regression  of 
the  type 


et  —  a.  +  Piyt-i  +  ■  ■  ■  +  Ppyt-p  +  +  ■  ■  ■  +  yr  et~r  +  uy. 

Here  r  is  chosen  to  incorporate  possibly  relevant  correlations  up  to  lag  r.  The 
LM- test  is  given  by  LM  =  nR 2  of  this  regression,  which  is  asymptotically 
distributed  as  y2{r)  under  the  null  hypothesis  that  the  AR (p)  model  is  correct. 
It  is  also  possible  to  use  the  F- test  on  the  joint  significance  of  the  parameters 
7i,  •  •  •  ,yr,  and  this  is  asymptotically  equivalent  to  the  LM- test.  In  a  similar 
way  one  can  test  for  residual  autocorrelation  in  ARMA(p,  q)  models  by 
adding  lagged  values  of  the  residuals  et  as  explanatory  variables.  For  in¬ 
stance,  for  an  ARMA(1,1)  model  the  test  equation  for  second  order  residual 
autocorrelation  becomes 


et  —  a  +  fiiyt-\  +  y\et-\  +  y^t-i  +  <-ot  +  1- 


This  auxiliary  equation  corresponds  to  the  general  principle  of  the  Breusch- 
Godfrey  test  in  Section  5.5.3  to  add  lagged  residuals  to  the  model  equation 
under  the  null  hypothesis.  Here  the  ‘regressors’  yt~\  and  ut- \  correspond  to 
the  chosen  ARMA(1,1)  model,  and  the  lagged  ARMA(1,1)  residuals  et~\  and 
et~i  are  the  added  regressors  to  test  for  the  presence  of  serial  correlation.  To 
perform  the  Breusch-Godfrey  test  for  ARMA  models,  one  first  estimates 
the  postulated  ARMA(p,  q)  model  by  ML,  with  residuals  et,  and  then  esti¬ 
mates  the  test  equation  (again  by  ML,  because  of  the  presence  of  MA  terms, 
with  corresponding  fitted  values  et).  Then  LM  =  nR 2  =  n(SSE/SST)  = 
n(JL  ej /  fT  ej)  of  the  test  equation. 

Forecast  performance  on  a  hold-out  set  of  data 

It  is  always  of  interest  to  compare  alternative  models  by  their  forecast 
performance.  In  many  situations  the  main  purpose  of  a  time  series  model  is 
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to  produce  out-of-sample  forecasts.  To  simulate  this  situation,  one  should 
leave  out  a  part  of  the  available  observations,  as  was  discussed  in  Section 
5.2.1  (p.  280).  Suppose  that  in  total  (m  +  n)  observations  are  available  and 
that  the  last  m  observations  are  left  out  for  evaluation  purposes.  Models  are 
then  identified,  estimated,  and  tested  on  the  basis  of  only  the  first  n  observa¬ 
tions  yt,  t=  1,  •••,«,  and  forecasts  are  produced  for  the  time  moments 
t  —  n  +  1,  ■  ,n  +  m.  These  forecasts  can  be  made  in  two  ways.  One  method 

is  to  predict  yn+/,  ^-steps-ahead  (‘dynamic’  forecasts),  using  only  the  obser¬ 
vations  until  t  =  n  in  producing  the  forecasts.  Another  method  is  to  predict 
yn+h  one-step-ahead  (‘static’  forecasts),  using  the  observations  until 
t  =  n  +  h  -  lin  forecasting  y„+/,.  Here  we  will  consider  the  case  of  dynamic 
forecasts,  as  this  is  more  relevant  in  actually  predicting  the  future  multiple 
steps  ahead. 


Forecast  evaluation  criteria 

The  forecast  performance  may  be  checked,  for  instance,  by  the  percentage  of 
the  m  observations  in  the  hold-out  sample  that  are  within  the  95  per  cent 
forecast  intervals  of  the  model.  As  was  discussed  in  Section  5.2.1,  different 
models  may,  for  instance,  be  compared  by  their  root  mean  squared  predic¬ 
tion  error  (RMSE)  and  their  mean  absolute  prediction  error  (MAE),  that  are 
defined  by 


RMSE 


^  '  ( yn+h  ^tn+h ) 

m  f — t 


1/2 


h=l 


MAE 


^  m 

TTT  'y  I  yn+h  ~  yn+h\- 

m  f — ' 

h=L 


Two  competing  models  can  also  be  compared  by  the  number  of  times  B  that 
the  absolute  error  \yn+h  —  yn+h\  of  the  first  model  is  smaller  than  that  of  the 
second  model.  If  the  models  forecast  equally  well,  then  B  has  the  binomial 
distribution  with  m  trials  and  with  chance  of  success  equal  to  The  first 
model  is  preferred  if  B  is  significantly  larger  than  j,  and  the  second  model  is 
better  if  B  is  significantly  smaller  than  One  can  test  the  hypothesis  that  both 
models  forecast  equally  well  (that  is,  that  this  chance  is  y)  by  means  of  the 
binomial  distribution.  If  m  is  large  enough,  then  (B  —  ™ )  is  approximately 
normally  distributed  with  mean  zero  and  variance  That  is,  under  the  null 
hypothesis  of  equal  forecast  quality 

-L(2B  -m)  «N(0,1). 

sjm 


For  instance,  one  may  choose  for  the  first  model  if  (IB  —  m)/ \Jm  >  2  and  for 
the  second  model  if  (2 B  —  m)/y/m  <  —2. 
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Summary  of  diagnostic  tests 

Summarizing,  in  time  series  analysis  the  following  diagnostic  tests  are  useful 
to  check  the  empirical  validity  of  the  model.  The  model  should  be  adjusted  if 
some  of  the  tests  lead  to  rejection  of  the  null  hypothesis  that  the  time  series 
model  is  correctly  specified.  If  several  models  pass  the  diagnostic  tests,  then  a 
choice  can  be  based  on  the  forecast  performance  of  the  models. 

•  Test  the  stationarity  of  the  time  series  (time  plot  and  correlogram  of  the 
series,  and  tests  described  in  Sections  7.3.3  and  7.3.4). 

•  Test  for  outliers  and  constant  variance  (time  plot  and  histogram  of  the 
series  and  of  the  model  residuals,  Jarque-Bera  test  and  Breusch-Pagan  test 
on  the  model  residuals;  further  tests  described  in  Section  7.4). 

•  Test  the  lag  structure  of  the  ARMA  model  (S(P)ACF,  t-  and  F- tests  on 
additional  lags,  AIC  and  SIC). 

•  Test  for  residual  autocorrelation  (SACF,  Ljung-Box  test,  Breusch-Godfrey 
test  on  model  residuals;  note  that  the  Durbin-Watson  test  should  not  be 
used,  as  the  regressors  are  stochastic  in  a  time  series  model). 

•  Evaluate  the  forecast  performance  (especially  dynamic  forecasts  on  a 
hold-out  sample),  and  compare  this  performance  between  competing 
models. 

We  shall  now  illustrate  this  by  means  of  an  example.  As  most  time  series  in 
business  and  economics  are  characterized  by  trends  and  seasonal  effects, 
further  empirical  applications  of  the  methods  treated  in  Section  7.2  (that 
are  valid  for  stationary  time  series)  are  postponed  until  Sections  7.3.3 
and  7.3.4. 


Example  7.1 1 :  Industrial  Production  (continued) 

We  continue  our  analysis  of  the  quarterly  data  of  yearly  growth  rates  of 
industrial  production  in  the  USA.  In  terms  of  the  modelling  steps  described  in 
Section  7.2.1,  steps  1-3  were  discussed  in  previous  examples  (see  Examples 
7.7,  7.8,  and  7.10).  Now  we  consider  steps  4-6  and  discuss  (i)  diagnostic 
tests  for  the  AR(2)  model  of  Example  7.8,  (ii)  diagnostic  tests  for  two 
alternative  models,  ARMA(2,5)  and  AR(5),  (iii)  the  forecast  performance 
of  the  AR(2)  model,  (iv)  a  remark  on  the  computed  forecast  intervals,  and  (v) 
a  comparison  of  the  forecast  quality  of  the  three  considered  models. 

(i)  Diagnostic  tests  for  the  AR(2)  model 

First  we  perform  diagnostic  tests  on  the  AR(2)  model  that  was  our  first 
guess  in  Example  7.7  (see  also  Example  7.8).  Exhibits  7.11  (a)  and  (b) 
show  the  time  plot  and  the  histogram  of  the  residuals  of  this  model.  The 
assumption  of  normally  distributed  error  terms  is  clearly  rejected  by  the 
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Jarque-Bera  test.  The  residuals  have  excess  kurtosis  (around  5.9,  as  com¬ 
pared  with  3  for  the  normal  distribution),  which  is  due  to  some  outlying 
observations.  The  influence  of  outliers  is  further  discussed  in  Example  7.17. 
Panel  3  of  Exhibit  7.11  shows  the  correlogram  of  the  residuals.  There  are 
n  =  136  observations,  so  that  correlations  are  significant  (at  5  per  cent 
significance  level)  if  they  are  larger  than  2/x/l 36  =  0.172  in  absolute 
value.  There  is  some  evidence  of  residual  correlation  at  lags  3,  4,  and  8. 
The  Ljung-Box  test  has  P-values  close  to  0.05  if  eight  or  more  lags  are 
included.  The  Breusch-Godfrey  test  for  serial  correlation,  with  four  lags 
included,  has  P-value  0.06  (see  Panel  4). 

(ii)  Diagnostic  tests  for  the  ARMA(2,5)  and  AR(5)  models 

Because  the  residuals  of  the  AR(2)  model  have  some  significant  autocorrela¬ 
tions  (for  instance  at  lag  4,  corresponding  to  a  lag  of  one  year),  it  is  worth¬ 
while  comparing  this  model  with  other  models  that  allow  for  a  richer 
correlation  pattern  in  the  time  series.  This  is  achieved  by  adding  extra  lags 
to  the  model.  We  consider  the  ARMA(2,5)  model  of  Example  7.10  (with 
parameter  estimates  in  Panel  1  of  Exhibit  7.10)  and  the  AR(5)  model  (with 
parameter  estimates  given  in  Panel  5  of  Exhibit  7.11).  Panel  6  of  Exhibit  7.11 
summarizes  the  outcomes  of  diagnostic  tests  for  the  three  models  —  that  is, 
AR(2),  ARMA(2,5),  and  AR(5).  Of  these  three  models,  the  ARMA(2,5) 
model  is  preferred  by  the  selection  criteria  AIC  and  SIC.  However,  normality 
of  the  residuals  is  rejected  for  all  models  and  the  AR(5)  model  performs 
relatively  best  in  this  respect.  The  non-normality  is  due  to  outliers,  as  will  be 
further  investigated  in  Section  7.4.1. 

(iii)  Forecast  performance  of  the  AR(2)  model 

Next  we  consider  the  forecast  performance  of  the  AR(2)  model  estimated  in 
Example  7.8  (see  (7.17)).  This  model  was  estimated  with  the  data  over  the 
years  1961-94,  and  now  we  will  forecast  the  series  over  the  period  1995.1- 
1998.3  (this  period  contains  fifteen  quarters).  First  we  consider  the  quality  of 
the  model  in  forecasting  the  growth  rate  in  the  next  quarter.  Exhibit  7.12  (a) 
shows  the  fifteen  one-step-ahead  point  forecasts  and  corresponding  95  per 
cent  forecast  intervals  of  S.4yt  for  1995.1  to  1998.3,  together  with  the 
actually  realized  growth  rates.  These  forecasts  are  quite  accurate  and  the 
95  per  cent  forecast  intervals  include  all  the  actual  values.  Next  we  consider 
multi-step-ahead  forecasts  of  the  growth  rate,  ranging  from  one  quarter 
ahead  (for  1995.1)  to  fifteen  quarters  ahead  (for  1998.3).  Exhibit  7.12  ( b ) 
shows  these  fifteen  P-step-ahead  forecasts  of  A4 yt.  The  forecasts  converge  to 
the  mean  value  of  3.34  per  cent  for  increasing  values  of  the  horizon  P.  For 
longer  horizons  the  95  per  cent  forecast  intervals  become  very  wide  and  even 
include  substantial  negative  growth  rates. 
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(a) 


( b ) 


|  RESIDAR2  . P2SE  - M2SE  | 


Panel  3:  Correlogram  RESIDAR2 

Sample:  1961:1  1994:4;  Included  observations:  136 

Lag 

SACF 

LB-Statistic 

P-value 

1 

-0.064 

0.5700 

0.450 

2 

0.010 

0.5850 

0.746 

3 

0.172 

4.7681 

0.190 

4 

-0.191 

9.9534 

0.041 

5 

0.013 

9.9777 

0.076 

6 

0.066 

10.599 

0.102 

7 

0.023 

10.676 

0.153 

8 

-0.194 

16.181 

0.040 

9 

-0.053 

16.596 

0.055 

10 

0.116 

18.613 

0.045 

11 

-0.106 

20.301 

0.041 

12 

0.029 

20.426 

0.059 

Panel  4:  Breusch-Godfrey  Serial  Correlation  LM  Test  (4  lags  included) 

F-statistic 

Obs*R-squared 

2.289063 

9.013347 

Probability 

Probability 

0.063290 

0.060767 

Test  Equation: 

Dependent  Variable:  RESIDAR2 

Method:  Least  Squares 

Variable 

Coefficient 

Std.  Error 

t- Statistic 

Prob. 

C 

-0.001508 

0.005028 

-0.299866 

0.7648 

D4Y(  — 1) 

0.152125 

0.292653 

0.519812 

0.6041 

D4Y(— 2) 

-0.106304 

0.183270 

-0.580037 

0.5629 

RESID(-l) 

-0.191041 

0.308317 

-0.619624 

0.5366 

RESID(— 2) 

-0.079094 

0.258568 

-0.305893 

0.7602 

RESID(— 3) 

0.116781 

0.195341 

0.597830 

0.5510 

RESID(— 4) 

-0.181992 

0.140802 

-1.292540 

0.1985 

Exhibit  7.1 1  Industrial  Production  (Example  7.11) 

Diagnostic  tests  on  the  residuals  of  the  AR(2)  model  for  the  series  D4Y  of  yearly  growth  rates 
of  US  industrial  production,  time  plot  {a),  histogram  (b),  correlogram  and  Ljung-Box  test 
(Panel  3),  and  Breusch-Godfrey  test  (Panel  4). 


574  7  Time  Series  and  Dynamic  Models 


Panel  5:  Dependent  Variable:  D4Y 
Method:  Least  Squares 

Sample:  1961:1  1994:4;  Included  observations:  136 


Variable 

Coefficient 

Std.  Error 

t- Statistic 

Prob. 

C 

0.006444 

0.002295 

2.807883 

0.0058 

D4Y(  — 1) 

1.360621 

0.082194 

16.55380 

0.0000 

D4Y(— 2) 

-0.650037 

0.130812 

-4.969235 

0.0000 

D4Y(— 3) 

0.365532 

0.131069 

2.788851 

0.0061 

D4Y(— 4) 

-0.536124 

0.121923 

-4.397234 

0.0000 

D4Y(— 5) 

0.269831 

0.078028 

3.458135 

0.0007 

R-squared 

0.844584 

Mean  dependent  var 

0.032213 

Adjusted  R-squared 

0.838607 

S.D.  dependent  var 

0.049219 

S.E.  of  regression 

0.019773 

Akaike  info  criterion 

-4.965862 

Sum  squared  resid 

0.050827 

Schwarz  criterion 

-4.837363 

Log  likelihood 

343.6786 

F-statistic 

141.2934 

Durbin-Watson  stat 

1.902114 

Prob(F-statistic) 

0.000000 

Panel  6:  Overview  of  diagnostic  tests 
Criterion  Diagnostic  test 

AR(2) 

AR(5) 

ARMA(2,5) 

Model  fit 

R-squared 

0.821 

0.845 

0.881 

St.Dev.  residuals 

0.021 

0.020 

0.017 

Sel.  Crit. 

AIC 

-4.871 

-4.966 

-5.201 

SIC 

-4.807 

-4.837 

-5.030 

Normality 

Skewness 

-0.108 

-0.300 

-1.007 

Kurtosis 

5.896 

4.936 

5.738 

Jarque-Bera 

47.784 

23.280 

65.490 

Ser.  Corr. 

Ljung-Box  (12  lags) 

P  =  0.059 

P  =  0.055 

P 

=  0.113 

Breusch-Godfrey  (4  lags) 

P  =  0.063 

P  =  0.025 

P 

=  0.004 

RMSE 

one-step  (95.1-98.3) 

0.0085 

0.0076 

0.0093 

multi-step  (95.1-98.3) 

0.0167 

0.0152 

0.0251 

Exhibit  7.11  (Contd.) 

AR(5)  model  for  the  series  D4Y  of  yearly  growth  rates  of  US  industrial  production  (Panel  5) 
and  overview  of  diagnostic  tests  for  the  AR(2),  AR(5),  and  ARMA(2,5)  models  (Panel  6). 


The  model  can  also  be  used  to  forecast  the  ‘levels’  yt  (in  logarithms)  instead 
of  the  growth  rates  of  industrial  production.  Exhibits  7.12  (c)  and 
(d)  show  respectively  the  one-step  (static)  and  multi-step  (dynamic) 
forecasts  of  the  series  yt  obtained  by  this  model.  This  shows  that  the  AR(2) 
model  provides  relatively  good  short-term  forecasts  (up  to  six  quarters)  but 
that  the  model  may  be  less  useful  in  long-term  forecasting  as  the  uncertainty 
becomes  very  large.  For  instance,  fifteen  quarters  (around  four  years)  ahead, 
Exhibit  7.12  (d)  shows  that  the  95  per  cent  forecast  interval  of  yt  (in 
logarithms)  has  a  width  of  0.4.  This  corresponds  to  a  factor  of  e0A  =  1.5  in 
the  actual  level  of  industrial  production  —  that  is,  an  uncertainty  of  around 
50  per  cent. 
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(a)  (b) 


Exhibit  7.12  Industrial  Production  (Example  7.11) 


One-step-ahead  forecasts  (a)  and  multi-step-ahead  forecasts  (b)  of  the  yearly  growth  rates  of 
US  industrial  production  (D4Y)  generated  by  the  AR(2)  model,  together  with  the  95%  forecast 
intervals  and  the  actually  realized  growth  rates.  Further,  implied  forecasts  of  the  logarithmic 
levels  (Y)  of  US  industrial  production:  static  forecasts  (one-step-ahead  (c))  and  dynamic 
forecasts  (multi-step-ahead  (d)),  together  with  the  95%  forecast  intervals  and  the  actual 
values  of  Y. 

(iv)  Remark  on  the  computed  forecast  intervals 

The  forecast  intervals  for  ^yt  are  computed  as  discussed  in  Section  7.1.6,  by 
substituting  the  estimated  values  of  the  AR(2)  model.  These  parameter 
estimates  are  themselves  uncertain,  but  this  is  not  taken  into  account  in 
constructing  the  uncertainty  bounds  of  the  forecasts.  For  the  case  of  regres¬ 
sion  models,  the  effect  of  parameter  uncertainty  on  prediction  intervals  was 
discussed  in  Section  3.4.3  (p.  171).  For  time  series  models  this  is  more 
complicated,  as  the  regressors  are  themselves  stochastic.  Forecast  intervals 
can  be  estimated  by  simulation.  However,  in  large  samples  the  effect  of 
parameter  uncertainty  vanishes,  and  in  practice  one  often  neglects  this  effect. 
The  forecast  intervals  for  yt  are  computed  from  those  of  A4 yt. 
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(v)  Comparison  of  forecast  quality  of  three  models 

Finally  we  compare  the  out-of-sample  forecast  quality  of  the  three  models, 
AR(2),  AR(5),  and  ARMA(2,5).  Panel  6  of  Exhibit  7.11  reports  the  RMSE  of 
the  three  models  for  static  and  dynamic  forecasts  of  the  series  A4 yt  of  growth 
rates  over  the  period  1995.1  to  1998.3.  The  AR(5)  model  gives  the  best 
predictions.  Note  that  this  model  did  not  perform  best  from  the  point  of 
within-sample  residual  diagnostics  and  that  it  was  not  selected  by  AIC  and 
SIC.  This  shows  that  within-sample  criteria  need  not  always  give  the  best 
model  for  out-of-sample  purposes.  It  is  therefore  of  importance  to  keep  some 
data  out  of  the  specification,  estimation,  and  diagnostic  phases  for  later 
model  selection  purposes.  This  may  in  particular  prevent  the  data  from 
being  over-fitted  by  models  that  contain  too  many  parameters.  Such  models 
improve  the  fit  over  the  estimation  sample  but  provide  worse  forecasts.  The 
ARMA(2,5)  model  seems  to  suffer  from  this  kind  of  over-fitting  (see  Panel  6 
of  Exhibit  7.11).  The  ARMA(2,5)  model  has  the  smallest  in-sample  residuals 
(the  standard  deviation  of  the  residuals  is  0.017,  which  is  smaller  than  that  of 
AR(2)  and  AR(5)),  but  it  gives  the  worst  forecasts  (the  RMSE  of  dynamic 
forecasts  is  0.0251,  which  is  considerably  larger  than  that  of  AR(2)  and 
AR(5)). 

The  overall  conclusion  is  that  the  AR(5)  model  performs  best  in  forecast¬ 
ing,  with  the  AR(2)  model  as  a  good  alternative.  The  ARMA(2,5)  model 
seems  to  be  somewhat  less  useful. 

=©  Exercises:  E:7.20e 


7.2.5  Summary 

In  this  section  we  have  discussed  a  sequence  of  steps  to  obtain  adequate 

models  for  observed  stationary  time  series. 

•  The  modelling  starts  with  graphical  inspection  of  the  time  series  and 
possible  transformations  to  obtain  stationarity. 

•  Next  the  orders  of  an  ARMA(p,  q)  model  are  chosen,  with  the  help  of 
the  sample  (partial)  autocorrelations. 

•  If  the  chosen  model  is  purely  autoregressive,  then  it  can  be  estimated  by 
OLS,  and,  if  the  model  contains  MA  terms,  it  is  estimated  by  ML  (or  by 
NLS  or  other  asymptotically  equivalent  methods). 

•  One  should  check  whether  the  estimated  model  is  adequate.  In  particu¬ 
lar,  it  is  of  interest  to  test  whether  the  model  residuals  are  uncorrelated 
and  whether  the  model  performs  well  in  producing  forecasts  on  a  hold¬ 
out  sample. 
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•  If  the  model  does  not  perform  well  enough,  one  can  improve  the  model 
by  selecting  other  orders  of  the  ARMA  model.  The  outcomes  of  diag¬ 
nostic  tests  and  model  selection  criteria  help  in  finding  better  models. 

•  Finally,  if  one  is  satisfied  with  the  obtained  model,  this  model  can  be 
used,  for  instance,  to  predict  future  values  of  the  time  series.  In  general, 
the  forecasts  will  perform  better  for  the  nearer  future  than  for  more 
distant  times. 
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7.3  Trends  and  seasonals 


Uses  Chapters  1-4;  Section  5.5;  parts  of  Section  5.3;  Sections  7.1,  7.2. 


7.3.1  Trend  models 

Deterministic  trends 

Many  economic  time  series  tend  to  grow  over  time  —  that  is,  they  display 
trending  behaviour.  This  is  the  case,  for  example,  for  the  level  of  industrial 
production  in  Example  7.1  (see  Exhibit  7.1  (a))  and  for  the  Dow-Jones  index 
in  Example  7.2  (see  Exhibit  7.2  (a)).  The  logarithm  of  both  series  also 
contains  trends.  Such  time  series  do  not  satisfy  the  assumption  of  stationarity 
that  is  required  for  the  methods  discussed  in  Sections  7.1  and  7.2.  If  the  series 
shows  a  more  or  less  steady  upward  or  downward  trend,  this  can  be  mod¬ 
elled  by  a  deterministic  trend.  The  simplest  model  is 

yt  =  a  +  fit  +  et,  f=l,  •••,«.  (7.18) 

This  corresponds  to  a  linear  trend.  This  model  is  clearly  non-stationary  (for 
/?  ^  0),  because  the  mean  E[yt ]  —  a  +  fit  varies  over  time.  The  non-stationar- 
ity  may  also  be  detected  from  the  autocorrelation  function.  For  n-—*  oo  all 
sample  autocorrelations  converge  to  one,  and  in  finite  samples  the  SACF 
tends  to  zero  very  slowly  (see  Exercise  7.13).  Other  trend  specifications  may 
also  be  of  interest  —  for  example,  a  quadratic  trend  with  trend  function 
f(t)  =  a  +  fit  +  yt2  or  a  trend  with  saturation  such  as  f(t )  =  a  +  /?£-1.  The 
above  trend  models  can  be  extended  by  including  (stationary)  AR  terms  and 
(invertible)  MA  terms.  For  instance,  an  ARMA  model  with  linear  determin¬ 
istic  trend  is  described  by 


4>{L)yt  =  a  +  fit  +  6(L)st, 

where  4>(z)  and  6{z)  have  all  their  roots  outside  the  unit  circle. 


(7.19) 


Estimation  of  models  with  deterministic  trends 

In  the  models  (7.18)  and  (7.19)  the  trend  term  t  is  non-stationary  and  the 
stability  condition  on  the  regressors  that  was  used  in  the  asymptotic  theory  of 
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Chapter  4  is  not  satisfied.  More  precisely,  Assumption  1*  in  Section  4.1.2 
(p.  193)  requires  that  the  probability  limit  of  ^X'X  exists.  However,  for  the 
regressor  xt  =  t  we  get  y  YTt=\  Xf  —^oo  for  n  — >  oo.  The  consequence  is  that 
the  OLS  estimator  of  the  slope  coefficient  /?  is  consistent  at  a  speed  that  is 
higher  than  the  usual  speed  sjn  —  namely,  n^Jn  (see  Exercise  4.2). 

A  simple  two-step  estimation  method  is  the  following.  First  regress  the 
time  series  yt  in  the  model  (7.18),  neglecting  possible  AR  and  MA  terms.  If 
the  trend  is  modelled  correctly,  the  residuals  are  stationary  and  can  be 
modelled  in  the  second  step  by  ARMA  models  as  discussed  in  Section  7.2. 
Instead  of  this  two-step  approach,  one  can  also  directly  estimate  the  ARMA 
model  (7.19)  with  the  deterministic  trend  included.  The  asymptotic  statis¬ 
tical  properties  of  the  estimated  ARMA  parameters  are  the  same  as  in  the 
stationary  case  discussed  in  Section  7.2.2,  provided  that  the  trend  has  been 
modelled  correctly. 

The  random  walk  model 

The  above  trend  models  are  called  deterministic,  as  they  impose  a  determin¬ 
istic  pattern  on  the  time  evolution  of  the  mean  of  the  time  series.  That  is, 
every  time  step  the  mean  of  the  series  increases  by  the  same  amount  /?. 
Another  type  of  trend  models  contains  so-called  stochastic  trends.  The 
simplest  model  is  the  random  walk 


yt  —  yt-\  +  (7.20) 

Here  ef  is  white  noise.  The  name  ‘random  walk’  originates  from  the  fact  that 
the  trend  direction  cannot  be  predicted,  because  for  given  value  of  yt~\  it  is 
equally  likely  that  yt  >  yt- 1  as  that  yt  <  yt~  i.  Indeed,  the  expected  change 
A yt  =  yt  —  yt- 1  =  £?  in  the  time  series,  conditional  on  the  past  information 
Yt-l  =  {yt- 1,  yt- 2,  ■  ■  ■},  is  equal  to 


£[Ayt|Y,_1]  =  £[et]  =  0. 

So  we  cannot  predict  whether  the  time  series  will  move  upward 
or  downward.  This  differs  from  stationary  series.  For  instance,  for  a  station¬ 
ary  AR(1)  model  yt  =  4>yt- 1  +  £t  with  —  1  <  </>  <  1  we  get  Ay*.  =  ((/>  —  l)yt-\ 
+£f,  and  as  ((/)  —  1)  <  0  it  follows  that 

£[Ayt|  Yt-,\  <  0  if  yt-\  >  0,  £[Ayf|  Y*_j]  >  0  if  yf_r  <  0. 

As  E[yt]  =  0,  this  means  that  a  stationary  series  has  the  tendency  to  return  to 
the  mean  value  of  the  process.  A  stationary  process  is,  therefore,  said  to  be 
mean  reverting.  The  random  walk  does  not  have  this  property. 
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Stochastic  properties  of  the  random  walk 

A  random  walk  series  does  not  have  a  steady  trend  direction.  However, 
during  some  time  intervals  a  sequence  of  particular  values  of  st  may  lead  to 
local  trendlike  movements  in  the  series  yt.  This  can  be  explained  by  recursive 
substitution  in  (7.20),  which  gives 


Jt  = 


yi  +  YJi 


5=2 


This  shows  that  a  large  value  of  ss  affects  the  values  of  the  time  series  yt  for  all 
t  >  s.  The  impact  of  a  shock  es  on  yt  does  not  diminish  over  time.  One 
therefore  says  that  the  shocks  in  this  model  are  persistent,  in  contrast  with 
stationary  processes,  where  the  effect  of  innovations  eventually  dies  out.  The 
random  walk  process  is  non-stationary.  For  instance,  if  y\  =  0  is  given,  then 
the  mean  of  the  series  is  E[yt]  =  0,  but  the  variance  is  equal  to 
var (yt)  =  (t  —  1  )cr2,  which  increases  over  time.  For  large  values  of  t,  the 
correlation  between  yt  and  yt_^  is  approximately  equal  to  f—^.  So  the  SACF 
will  have  values  that  are  very  close  to  1  and  that  die  out  only  very  slowly  (see 
Exercise  7.13). 

Integrated  processes  and  ARIMA  models 

The  AR  polynomial  of  the  model  (7.20)  is  given  by  <j)(z)  =  1—2,  which  has 
a  root  <j)(z)  =  0  at  z  =  1.  For  this  reason  the  random  walk  model  (7.20)  is 
said  to  have  a  unit  root.  This  shows  once  more  that  the  random  walk  is 
not  stationary,  as  in  Section  7.1.3  we  derived  the  condition  that  all  the  roots 
of  the  AR  polynomial  should  be  outside  the  unit  circle  for  a  stationary 
process.  The  process  yt  is  called  integrated  of  order  one,  as  yt  is  non- 
stationary  but  the  series  of  first  differences  \yt  =  (1  —  L)yt  =  yt  —  yt- i  =  &t 
is  stationary. 

The  random  walk  model  can  be  extended  by  including  a  constant  term  and 
by  incorporating  (stationary)  AR  and  (invertible)  MA  terms.  An 
ARIMA(p,  d,  q)  model  has  the  property  that  A kyt  is  non-stationary  for  all 
k  <  d  and  that  b.dyt  is  stationary  and  follows  an  ARMA(p,  q)  model.  Such 
models  are  described  by 

</>(T)(l  —  L)^yt  =  a  +  6(L)et, 

where  4>(z)  and  0(z)  are  polynomials  of  degrees  p  and  q  respectively  that  have 
all  their  roots  outside  the  unit  circle.  Such  a  process  is  called  integrated  of 
order  d,  and  the  process  is  said  to  have  d  unit  roots.  Because  series  that  are 
integrated  of  order  d  =  1  have  the  property  that  the  difference  yt  —  yt-\  is 
stationary,  such  series  yt  are  called  difference  stationary.  For  time  series  in 
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business  and  economics,  the  cases  d  —  0  and  d  =  1  are  of  most  importance. 
Stationary  series  have  d  =  0,  and  non-stationary  series  often  become  station¬ 
ary  after  taking  first  differences.  In  some  applications  one  may  encounter 
series  that  are  integrated  of  order  d  =  2  —  for  instance,  some  nominal  price 
series  may  have  this  property.  Higher  orders  of  integration  are  very  rare  in 
practice. 

Estimation  and  diagnostic  tests  of  models  with  trends 

An  ARIMA(p,  d,  q )  model  can  simply  be  estimated  and  evaluated  by 
applying  the  results  of  Section  7.2  on  the  suitably  differenced  stationary 
series  (1  —  L)dyt.  The  question  whether  the  trend  is  deterministic  or  stochas¬ 
tic  and,  in  the  latter  case,  the  question  what  is  the  order  of  integration  is 
treated  in  Section  7.3.3.  The  conventional  statistical  tests  (such  as  t-tests  and 
LR- tests)  and  the  diagnostic  tests  of  Section  7.2.4  remain  valid  after  the  trend 
has  been  appropriately  removed  (by  regression  in  the  case  of  a  deterministic 
trend  or  by  differencing  the  data  in  the  case  of  a  stochastic  trend).  However, 
if  the  model  is  misspecified  because  a  deterministic  trend  is  wrongly  ex¬ 
cluded  or  because  the  data  are  not  properly  differenced,  then  the  results 
of  Section  7.2  do  not  apply  anymore.  That  is,  the  conventional  tests  no 
longer  follow  the  standard  distributions  of  the  stationary  case.  This  affects 
all  standard  inference  procedures.  It  is,  therefore,  of  major  importance  to 
model  the  trend  appropriately  before  any  further  analysis  of  the  data  is 
performed. 

Random  walk  with  drift 

If  a  constant  term  is  added  in  the  random  walk  model  (7.20),  then  we  get 

Ayt  =  a  +  st. 

This  is  called  a  random  walk  with  drift,  and  a  is  the  drift  term.  By  recursive 
substitution  this  can  be  written  as 


t 

yt  =  yi  +  a{t  -  l)  + 

s=2 


So  the  constant  term  a  becomes  the  coefficient  of  a  deterministic  trend 
component  in  the  time  series.  This  shows  that  the  role  of  the  constant  term 
in  a  model  with  a  unit  root  is  different  from  the  one  in  stationary  models. 
The  constant  term  has  a  similar  trend  interpretation  in  more  general 
ARIMA(p,  1,  q )  models  that  are  integrated  of  order  one.  The  key  difference 
from  the  deterministic  trend  model  (7.19)  is  the  stochastic  trend  term 
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We  summarize  some  of  the  properties  of  the  random  walk  with 
drift  by  comparing  this  model  with  the  general  AR(1)  model 
yt  =  a  +  (j)yt- 1  +  £*.  If  —  1  <  </*  <  1,  then  the  series  yt  is  stationary  with 
mean  f.i  =  a/(l  —  (/>).  So  in  this  case  a  =  (1  —  </>)/*,  and  we  can  rewrite  the 
model  as  yt- ^  =  (j){yt-i  — /t)+et  and  also  as  Ayt=A =  (</>-  1)  (y,_i-/i) 
+£f.  As  4>  ~  1  <  0  for  the  stationary  series  it  follows  that 

£[Ay*|  Y*-i]  <  0  if  y,_i  >  n,  £[Ayf|Yt_i]  >  0  if  y,_i  <  n. 

This  shows  again  that  a  stationary  series  is  mean  reverting.  On  the  other 
hand,  in  the  random  walk  with  drift  model  we  get 

£[Ayf|  Yf_i]  =  E|a  +  £,]  =  a. 

So  in  this  case  we  always  expect  the  series  to  move  forward  (upward  if  a  >  0 
and  downward  if  a  <  0).  This  shows  that  the  process  is  not  stationary. 
Further,  by  writing  the  random  walk  with  drift  as  yt  =  y\  +  a(t  —  1)  + 
Y!s=2&s,  it  follows  that  E[yf|yi]  =  yi  +  (t  -  l)a  and  var(yf|yi)  =  (t  -  l)u2, 
so  that  both  the  mean  and  the  variance  are  unbounded. 

Trend  stationary  processes 

Next  we  consider  the  model 


yt  —  a  +  fit  +  4>yt-  i  +  £?•  (7.21) 

This  model  contains  as  special  cases  the  stochastic  trend  model  (7.20)  (for 
a  = /?  =  0  and  0  =  1)  and  the  deterministic  trend  model  (7.19)  (if 
—  1  <  (/)  <  1).  We  consider  the  model  with  —  1  <  cj)  <  1  in  more  detail. 
In  this  case  the  AR  polynomial  4>(z)  =  1  —  </>£  is  stationary.  Define 
Zt  =  yt  —  di  —  eh t,  where  =  P/(l  —  </>)  and  d\  =  (a  —  0<5i )/( 1  —  </>);  then 
it  follows  by  direct  substitution  in  equation  (7.21)  that  Zt  =  (f)Zt-i  +  £? 
and  hence 


(y?  -  (>i  -  <52t)  =  4>{yt- 1  -  <5i  -  d2(t  -  1))  +  £f. 

Because  —  1  <  0  <  1,  the  process  zt  is  stationary,  so  that  the  effect  of  the 
innovations  £t  eventually  dies  out.  So  in  this  case  f(t)  =  di  +  8it  is  the  long¬ 
term  trend  in  the  series  and  deviations  from  this  trend  are  transient.  The 
series  returns  to  the  trend  f(t)  in  the  long  run.  Therefore  the  series  yt  is  called 
trend  stationary  in  this  case.  More  generally,  every  process  (7.19)  with 
stationary  AR  polynomial  4>(z)  is  trend  stationary.  On  the  other  hand,  if 
4>  =  1  and  p  ^  0  in  (7.21),  then  we  can  rewrite  this  as  Ayt  =  a.  +  fit  +  st, 
so  that  £[Ayf|  Y^_i]  =  oc  +  fit.  In  this  case  the  series  is  expected  to  exhibit 
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changes  that  become  larger  as  time  progresses.  This  corresponds  to  quadratic 
trend  behaviour,  and  the  model  contains  both  a  stochastic  trend  (as  0=1) 
and  a  deterministic  trend  (as  /?  ^  0). 


A  model  with  latent  trend 

As  an  alternative  for  deterministic  and  stochastic  trend  models,  the  trend  can  also 
be  modelled  by  a  so-called  latent  trend  variable.  For  example,  let 


yt  =  dt  +  6f,  nt+ 1  =a  +  nt  +  t]„  (7.22) 

where  e*  ~  N(0,  a2)  and  r\t  ~  N(0,  a2)  are  independent  white  noise  processes.  The 
trend  component  fit  is  unobserved  and  follows  a  random  walk  with  drift.  If 
a2  =  0,  this  model  reduces  to  a  deterministic  trend,  as  jit  =  Hi  +  ot(t  —  1)  in  this 
case.  If  a2  =  0,  then  the  model  reduces  to  a  random  walk  with  drift 
yt  =  a  +  yt- i  +  nt-  if  a2  >  0,  then  the  trend  can  be  eliminated  by  taking  first 
differences.  This  gives 


A yt  =  a  +  fif  -fit-i  +»7f_i- 


As  the  correlations  of  the  composite  error  term  on  the  right-hand  side  are  zero  for 
all  lags  k  >  1,  it  follows  that  Ay*  is  an  MA(1)  process.  So  y*  is  an  ARIMA(0,1,1) 
process  and  can  be  written  as 


A  yt  —  <x  -\-  to  t  ~  du>t- 1 .  (7.23) 

The  parameter  9  can  be  derived  from  a2  and  a2,  as  follows.  As  e*  and  i]t  are 
independent  white  noise  processes,  it  follows  that  Ayt  has  variance  y0  =  2 of  + 
and  first  order  covariance  y\  =  On  the  other  hand,  from  the  ARIMA(0,1,1) 
model  it  follows  that  y0  =  (1  +  62)a ^  and  y1  =  — Otr 2.  Let  A  =  o2/a\  be  the  so- 
called  signal-to-noise  ratio  of  the  model;  then  the  first  order  autocorrelation  of  Ay* 
is  given  by 


^  __  E[(Ay*  -  «)(Ay*_!  -  «)]  _  yi  _  -9  _  -a;  _  -1 

Pl~  E[(Ay*  —  a)2]  ~  y0~  l  +  91  ~  2cji  +  ~  2  +  A' 

Because  A  >  0,  it  follows  that  9  >  0,  and  the  (invertible)  solution  (with  9  <  I)  for  9 
in  terms  of  A  is  given  by  6  =  l+  jA  —  j  \/  A2  +  4  A .  In  the  next  section  we  consider 
the  use  of  this  trend  model  in  trend  estimation  and  forecasting. 


Example  7.12:  Simulated  Series  with  Trends 

We  give  a  graphical  illustration  of  the  differences  between  trend  stationary 
and  difference  stationary  processes  by  simulating  time  series  from  the  model 
(7.21),  for  different  values  of  the  parameters  (a,  /?,  </>).  Exhibit  7.13  shows 
graphs  of  the  following  five  simulated  time  series. 


E 
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(a)  ( b )  (c) 


100  200  300  400  500  100  200  300  400  500  100  200  300  400  500 


100  200  300  400  500  100  200  300  400  500 


Exhibit  7.13  Simulated  Series  with  Trends  (Example  7.12) 

Simulated  series  of  length  500  ((/)-(;))  and  first  fifty  observations  ((a)-(e))  generated  by  the 
model  yt  =  a  +  [It  +  +£«,  where  st  are  NID(0,1),  for  different  values  of  the  parameters 

(a,  p,  <j>). 
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•  The  first  series  has  parameters  (a,  /?,  </>)  =  (1,  0,  0.8).  This  is  a  stationary 
series  without  trend. 

•  The  second  series  has  parameters  (a,  P,  </>)  =  (1,  0.01,  0.8).  This  series  is 
trend  stationary.  The  series  shows  some  short-term  fluctuations  but  it 
always  returns  to  the  long-term  trend  1  +  O.Olt. 

•  The  third  series  is  the  random  walk  with  parameters  (a,  /?,  (/>)  =  (0,  0,  1). 
This  series  has  prolonged  periods  of  up-  and  downward  movements,  but 
there  is  no  clear  overall  trend  direction. 

•  The  fourth  series  is  a  random  walk  with  drift  with  parameters 
(a,  /f,  (/>)  =  (0.1,  0,  1).  This  series  shows  an  upward  trend  but  the  location 
of  the  trend  is  not  stable. 

•  The  fifth  series  has  parameters  (a,  /l,  (/>)  =  (0,  0.001,  1).  This  series  con¬ 
tains  a  stochastic  trend  (as  4>  =  1)  and  a  deterministic  trend  (as  /?  =  0.001). 
The  combination  of  these  two  trends  results  in  quadratic  trend  behaviour, 
since  the  growth  A yt  =  O.OOlt  increases  over  time. 

Exhibit  7.13  (f)-(j)  show  the  series  over  a  sample  period  of  length  n  =  500 
where  the  differences  are  quite  pronounced;  (a)-(e)  show  the  series  for  the 
first  fifty  observations,  and  the  differences  are  much  less  clear  in  this  case. 
Obviously,  differences  in  trends  can  be  distinguished  only  after  a  sufficiently 
long  observation  period. 

Exercises:  S:  7.13a,  b. 


7.3.2  Trend  estimation  and  forecasting 

Forecasting  a  deterministic  trend 

If  a  time  series  shows  trending  behaviour,  this  is  of  major  importance  in 
forecasting.  We  consider  the  estimation  and  forecasting  of  the  trend  models  of 
the  foregoing  section  —  that  is,  trends  that  are  deterministic,  stochastic,  or 
latent.  First  we  consider  the  linear  deterministic  trend  model  (7.18)  —  that  is, 

yt  =  a  +  fit  +  st,  t  =  1,  •  •  • ,  n. 

We  suppose  that  the  time  series  yt  is  observed  at  times  t  =  1,  •  •  • ,  n,  and  we 
wish  to  forecast  yt  at  time  t  =  rt  +  h.  Let  a  and  b  denote  the  OLS  estimates  of  a 
and  /f,  based  on  the  data  (jq,  ■  ■  ■ ,  y„).  The  fe-step-ahead  forecast  is  given  by 


Jn+h  =  a  +  b(n  +  h). 
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If  we  neglect  the  errors  in  the  parameter  estimates  —  that  is,  if  we  assume  that 
a  =  a  and  b  =  ft  —  then  the  forecast  error  variance  is  a1  for  all  forecast 
horizons.  If  the  trend  is  deterministic,  then  the  uncertainty  about  the  future 
does  not  depend  on  the  forecast  horizon.  If  the  parameter  uncertainty  is 
taken  into  account,  then  the  results  in  Section  2.4.1  (p.  105)  show  that  the 
forecast  variance  is  equal  to 


E[(y„+b  -  yn+b)~ \  =  o  i  +  -  + 


1  n  +  h-'-^r i 


Here  the  last  approximation  is  valid  if  the  number  of  observations  n  is  large 
compared  to  the  forecast  horizon  b.  This  motivates  the  usual  practice  to 
neglect  the  parameter  uncertainty  in  constructing  prediction  intervals. 


Forecasting  a  stochastic  trend 

Next  we  consider  the  random  walk  with  drift 

yt  =  a  +  yt-\  +  st,  t  =  2,  •  •  • ,  n. 

Let  a  be  the  OLS  estimate  of  a.  This  estimate  is  obtained  by  regression  in  the 
model  Ayt  —  a.  +  st,  so  that  a  is  the  sample  average  of  A yt  with  variance 
cr2/(n  —  1).  The  ^7-step-ahead  forecast  is  given  by 

yn+h  =  Jn  +  ah. 

If  a  —  a,  then  the  forecast  error  is  equal  to  yn+h  —  yn+h  —  S/=i  e«+/  with 
forecast  variance  ha1.  Therefore,  in  contrast  with  a  deterministic  trend,  for 
series  with  a  stochastic  trend  the  forecast  uncertainty  grows  for  larger 
forecast  horizons.  If  the  parameter  uncertainty  is  taken  into  account,  then 
the  forecast  error  is  given  by  yn+i,  —  yn+h  =  h( a  —  a)  +  Y^=  l  £«+/-  the 
estimate  of  a  is  based  on  the  observations  yt  with  t  <  n,  all  terms  in  this 
error  are  uncorrelated.  It  follows  that 


E[(y,i+h  -%+h)1]  =a2ih  + 


h1 

n  —  1 


ha1 


where  the  last  approximation  is  valid  if  n  is  large  compared  to  h.  This  again 
motivates  the  usual  practice  of  neglecting  the  parameter  uncertainty  in 
constructing  prediction  intervals  for  the  series  yt. 
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Forecasting  of  ARMA  models  with  deterministic  trends 

The  foregoing  trend  models  are  concerned  only  with  the  long-run  trend  and 
neglect  possible  short-run  fluctuations  in  the  series.  Now  we  consider  fore¬ 
casts  for  the  ARMA  model  with  deterministic  trend 

4>{L)yt  =  a  +  [it  +  6(L)et. 

Here  4>(z)  satisfies  the  stationarity  condition  and  6(z)  the  invertibility  condi¬ 
tion.  As  in  our  analysis  of  (7.21)  in  Section  7.3.1,  we  can  write  the  model  in 
the  form 


4>(L)zt  =  6(L) Bt,  zt  =  yt-  c)i  -  S2t. 

Here  the  parameters  hi  and  S2  are  chosen  in  such  a  way  that  (j>{L)(d i  +  Sit) 
=  a.  +  fit.  The  parameters  (<5i,  52)  are  obtained  from  (a,  /f)  by  solving  the 
equation  cf)(L)(5 1  +  Sit)  =  4>{1)S\  +  (t  —  </>£(f  —  k))S2  =  a  +  fit.  To  obtain 

forecasts  of  yt,  we  can  first  forecast  the  stationary  series  zt  by  the  methods 
discussed  in  Section  7.1.6.  Then  forecasts  for  yt  are  computed  by 


Jn+h  —  Zn+h  +  +  ^l(n  +  k). 


If  the  parameter  uncertainty  in  S\  and  S2  is  neglected,  then  the  forecast  error 
variance  of  yt  is  the  same  as  that  of  zt,  because  yt+i,  —  yt+i,  =  zt+h  ~  zt+h  in 
this  case. 

Forecasting  of  ARIMA  models 

Next  we  consider  the  forecasting  of  time  series  that  are  integrated  of  order 
1  and  that  are  described  by  the  ARIMA(p,  1,  q)  model 

</>(T)(  1  -  L)yt  =  a  +  6(L)st. 

The  methods  of  Section  7.1.6  can  be  used  to  forecast  the  stationary  variable 
zt  =  (1  —  L)yt.  As  yn+h  =  y„  +  zn+j,  it  follows  that  the  f7-step-ahead 
forecast  of  yt  is  given  by 

b 

Jn+h  =  Jn  T  'y  ^  Zn+j  ■ 

7=1 

The  forecast  error  is  Y^=  l  {zn+j  —  zn+j)-  Using  the  notation  of  Section  7.1.6, 
the  forecast  variance  is  given  by  SPE(^)  =  a 2  Y^!j=i  ( J2kJo  •A k )  (see  Exercise 
7.5). 
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Trend  forecasts  by  exponential  smoothing  (EWMA) 

Next  we  consider  trend  forecasting  in  the  model  (7.22)  with  latent  trend 
variable  pt.  To  derive  the  forecast  formula  we  rewrite  this  model  as  the 
ARIMA(0,1,1)  model  (7.23).  For  simplicity  we  assume  for  the  moment 
that  the  model  contains  no  drift,  so  that  a  =  0.  The  MA(1)  model  for 
zt  =  A yt  is  then  given  by  zt  =  ojt  —  where  0  <  9  <  1.  Because  the  MA 
model  for  zt  is  invertible,  the  results  in  Section  7.1.6  show  that  t ot  =  Zt  —  Zt, 
where  the  one-step-ahead  forecast  zt  of  zt  is  equal  to  zt  =  —0oJt- 1-  Since  the 
value  of  yt_\  is  known  at  time  (t  —  1),  it  follows  that  zt  =  A yt  =  yt  —  yt~i, 
where  yt  is  the  forecast  of  yt  based  on  the  observations  (yt-k,  k  >  1).  The 
foregoing  results  for  zt  and  zt  imply  that  ujt  =  zt  —  Zt  =  yt  —  yt  and  that 
yt  =  yt-\  +  zt  =  yt- 1  —  9ut-\.  So  the  one-step-ahead  forecasts  of  yt  are  re¬ 
lated  by 


Jt+ 1  =yt~  9ujt  =  yt~  9(yt  -  yt)  =  (1  -  9)yt  +  9yt. 

Here  1  —  9  is  called  the  smoothing  factor.  If  this  factor  is  small  (that  is, 
if  9  is  close  to  one),  then  the  old  forecast  yt  has  relatively  more  weight 
than  the  most  recent  observation  yt.  If  the  smoothing  factor  is  large  (so 
that  9  is  close  to  zero),  then  the  new  observation  yt  has  a  relatively  large 
weight.  By  repetitive  substitution,  the  above  forecast  equation  can  be 
written  as 


y>t+\  =  (l  -  0)  X!  9’yt-j- 

i= o 

In  the  trend  model  (7.22)  the  term  st  is  white  noise,  so  that  jit  = 
E[nt\Yt-\]  =  yt  and  hence 


00 

Af+i  =  (i  -  9)yt  +  9  jit  =  (l  -  9)^2  0'jt-i- 

i=o 

So  in  this  latent  trend  model  the  trend  is  forecasted  as  a  weighted  average  of 
the  past  observations,  with  exponentially  declining  weights  that  sum  up  to 
unity.  This  is  called  the  exponentially  weighted  moving  average  (EWMA) 
method  of  trend  estimation.  One  also  says  that  the  trend  is  estimated  by 
exponential  smoothing.  The  above  forecast  formula  can  also  be  used  for 
b-step-ahead  forecasts,  as  yn+y  =  fin+i,  =  p„.  So  the  forecasts  are  the  same 
for  all  horizons.  This  shows  that  EWMA  should  be  used  only  for  series  that 
do  not  have  a  clear  trend  direction. 
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Choice  of  smoothing  factor  in  EWMA 

In  practice  the  infinite  summation  of  the  EWMA  is  truncated  (as  the  time 
series  is  not  observed  for  t  <  0).  Further,  the  smoothing  parameter  0  should 
be  specified  by  the  user.  Smooth  trend  estimates  are  obtained  by  choosing 
small  values  for  the  smoothing  factor  —  that  is,  for  0«1,  For  0wO  the  trend 
follows  the  fluctuations  in  the  time  series  quite  rapidly.  One  possible  method 
to  choose  9  is  by  minimizing  the  sum  of  squared  one-step-ahead  forecast 
errors  Yl't= i  () h  ~  yt)1  ■  If  the  error  terms  uit  are  normally  distributed,  this 
corresponds  to  the  ML  estimate  of  9  in  the  MA(1)  model  for  Ay,.  As  this 
criterion  focuses  on  one-step-ahead  forecasts,  the  obtained  trend  is  often  only 
suitable  for  short-run  forecasts.  If  long-run  forecasts  are  needed,  then  other 
values  of  9  with  smoother  trends  may  provide  better  results. 

Holt-Winters  trend  forecasts 

In  the  above  derivation  of  the  EWMA  we  assumed  that  a  =  0  in  (7.22),  so 
that  the  time  series  yt  has  no  drift  term  in  the  trend.  If  the  time  series  has  a 
clear  overall  trend  direction,  then  this  can  be  modelled  by  taking  a  ^  0  in 
(7.22),  but  instead  one  often  uses  a  more  flexible  model  that  allows  for 
variations  in  the  trend  component  a.  This  leads  to  the  model 

yt  =  Ht  +  £t,  Ht+i  =  <*t  +  +  9t,  at+1  =  ctt  +  Ct- 

Here  the  three  noise  processes  (st,  r\t,  (t)  are  assumed  to  be  mutually  independ¬ 
ent  and  normally  distributed  white  noise  processes.  Least  squares  forecasts 
and  trend  estimates  can  be  determined  in  a  way  similar  to  the  method  discussed 
before  for  EWMA.  This  gives  the  so-called  Holt-Winters  trend  estimate 

A*+i  =  (1  -  9i)yt  +  01  (at  +  Af),  «f+i  =  (1  -  02)(Am  -  A?)  +  02 Sf. 

To  apply  this  method  one  should  specify  two  smoothing  factors, 
0  <  1  —  0i  <  1  for  the  ‘level’  nt  and  0  <  1  —  02  <  1  for  the  ‘slope’  af  in  the 
series.  These  parameters  can  be  estimated,  for  instance,  by  ML  in  the  corres¬ 
ponding  ARIMA( 0,2,2)  model  for  yt  (see  Exercise  7.5).  Whereas  the  EWMA 
forecasts  are  the  same  for  all  horizons,  so  that  they  lie  on  a  horizontal  line, 
the  Holt-Winters  forecasts  lie  on  a  straight  line  (see  Exercise  7.5). 

Example  7.13:  Industrial  Production  (continued) 

We  continue  our  analysis  of  the  data  on  industrial  production  in  the  USA.  In 
Section  7.2  we  considered  the  quarterly  series  A4y,  of  yearly  growth  rates. 
The  fourth  difference  removes  the  trend  of  the  series,  so  that  A4y,  could  be 
modelled  as  a  stationary  series  by  means  of  ARMA  models  (see  for  instance 
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Example  7.11).  Instead  of  this  differenced  series,  we  will  now  consider  the 
series  yt  consisting  of  the  logarithms  of  the  index  of  US  industrial  production. 
This  time  series  was  plotted  in  Exhibit  7.1  ( b )  in  Example  7.1.  We  will  discuss 
several  models  for  the  trend  of  this  series  —  namely  (i)  a  deterministic  trend 
model,  (ii)  a  stochastic  trend  model,  and  (iii)  EWMA  and  Holt-Winters  trend 
estimates. 

(i)  Deterministic  trend  model 

First  we  fit  a  deterministic  linear  trend.  Exhibit  7.14  (b)  and  (c)  show  that  the 
series  moves  around  this  trend  line  but  that  deviations  persist  for  relatively 
long  periods.  This  strong  positive  serial  correlation  is  also  indicated  by  the 
very  low  value  for  the  Durbin-Watson  statistic  (0.08)  of  the  regression  in 
Panel  1  of  Exhibit  7.14.  Exhibit  7.14  (b)  also  shows  the  (dynamic)  out-of- 
sample  forecasts  over  the  period  1995.1  to  1999.4.  Comparing  these  fore¬ 
casts  with  the  actual  values  over  the  period  1995.1  to  1998.3,  we  see  that  the 
general  direction  is  predicted  quite  accurately  but  that  the  forecast  intervals 
are  quite  wide.  The  intervals  are  equally  wide  for  all  horizons.  The  RMSE 
over  the  period  1995.1  to  1998.3  is  0.0249. 

(ii)  Stochastic  trend  model 

Next  we  estimate  a  random  walk  with  drift.  Panel  4  of  Exhibit  7.14  shows 
the  corresponding  regression,  with  residuals  in  (f),  and  (e)  shows  the  (dy¬ 
namic)  out-of-sample  forecasts.  The  point  forecasts  are  comparable  to  those 
of  the  deterministic  trend  model.  For  short  forecast  horizons  the  forecast 
intervals  are  narrower  than  for  longer  horizons.  The  RMSE  over  the  period 
1995.1  to  1998.3  is  0.0348.  So  the  model  with  stochastic  trend  performs 
worse  in  this  respect  as  compared  with  the  deterministic  trend  model. 

(iii)  EWMA  and  Holt-Winters  trend  estimates 

Exhibit  7.15  (a)  and  ( b )  show  the  trend  estimates  obtained  by  EWMA.  If  the 
smoothing  factor  is  estimated  by  ML,  this  gives  a  value  of  1  —  9  =  0.97  in 
(a)  —  that  is,  the  past  is  forgotten  very  fast.  If  we  set  this  smoothing  factor  at 
1  —  9  =  0.20  in  ( b ),  then  the  estimated  trend  becomes  much  smoother  and 
trend  deviations  in  the  series  are  followed  only  after  longer  delays.  Further 
note  that  in  the  EWMA  the  estimated  trend  always  lags  behind  the  observed 
series.  This  indicates  once  more  that  this  method  is  not  suitable  for  trending 
series.  Exhibits  7.15  (c)  and  (d)  show  the  Holt-Winters  trend  estimates,  with 
ML  smoothing  factors  (1  —  6\  =  1.00  and  1  —  62  =  0.02)  in  (c)  and  with 
both  smoothing  factors  equal  to  0.20  in  (d).  The  last  trend  estimate  is 
relatively  smooth  and  it  lags  behind  much  less  than  the  smooth  EWMA 
trend. 
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Panel  1:  Dependent  Variable:  Y 

Method:  Least  Squares 

Sample:  1961:1  1994:4;  Included  observations:  136 

Variable 

Coefficient 

Std.  Error  t-Statistic 

Prob. 

C 

3.781132 

0.012345  306.2982 

0.0000 

@TREND(61.1) 

0.007140 

0.000158  45.16380 

0.0000 

Durbin-Watson  stat 

0.082816 

(b)  (c) 


(■ d ) 


Panel  4:  Dependent  Variable:  D(Y) 

Method:  Least  Squares 

Sample:  1961:1  1994:4;  Included  observations:  136 

Variable  Coefficient  Std.  Error  t-Statistic  Prob. 

C  0.008400  0.001793  4.686264  0.0000 


if) 


Exhibit  7.14  Industrial  Production  (Example  7.13) 

Deterministic  trend  model  for  the  logarithm  of  US  industrial  production  (Panel  1)  with  fitted 
values  over  1961.1-1994.4  and  with  forecasts  for  1995.1-1999.4  with  95%  forecast  intervals 
( b )  and  with  corresponding  residuals  (c).  Stochastic  trend  model  (Panel  4)  with  similar  graphs 
((e)  and  (f);  the  fitted  values  in  (e)  lag  closely  behind  the  actual  values). 
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(a)  ( b ) 


Exhibit  7.15  Industrial  Production  (Example  7.13) 

Trend  estimates  obtained  by  EWMA  ((a)-(b),  with  ML  smoothing  factor  1  —  0  =  0.97  in  (a) 
and  with  1  —  0  =  0.2  in  ( b ))  and  by  Holt-Winters  ((c)-(d),  with  ML  smoothing  factors 
1  —  01  =  1.00  and  1  —  02  =  0.02  in  (c)  and  with  1  —  0i  =  1  —  02  =  0.2  in  ( d )). 


Exercises:  T:  7.5a-c;  E:  7.17b,  7.18e,  7.20c,  d. 


7.3.3  Unit  root  tests 
Formulation  of  the  testing  problem 

The  analysis  in  the  two  foregoing  sections  shows  that  it  is  of  importance  to 
model  the  trend  in  an  appropriate  way.  In  particular,  one  should  distinguish 
deterministic  from  stochastic  trends.  If  the  trend  is  deterministic,  then  the 
series  reverts  to  the  trend  line  in  the  long  run,  innovation  shocks  have  an 
effect  that  diminishes  over  time,  and  the  forecast  variance  is  constant  for  all 
horizons.  On  the  other  hand,  if  the  trend  is  stochastic,  then  the  series  does  not 
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revert  to  a  long-term  trend  line,  innovation  shocks  have  a  permanent  and 
non-vanishing  effect,  and  the  forecast  variance  increases  for  larger  horizons. 
In  this  section  we  discuss  tests  for  the  nature  of  the  trend.  In  formulating 
the  testing  problem,  we  should  make  sure  that  the  models  under  the  null 
hypothesis  and  under  the  alternative  hypothesis  are  reasonable  competing 
alternatives. 

As  a  first  step  we  can  make  a  time  plot  of  the  series  to  see  if  it  has  any 
trending  pattern.  In  many  cases  it  has.  A  simple  test  on  the  nature  of  the  trend 
can  be  based  on  the  model  (7.21)  —  that  is, 


yt  =  a  +  fit  +  +  £f, 


where  et  is  white  noise.  This  model  corresponds  to  a  deterministic  trend  if 
—  1  <  4>  <  1  and  p  7^  0,  and  it  corresponds  to  a  stochastic  trend  if  </>  =  1  and 
P  =  0.  The  case  cj)  =  1  and  P  ^  0  is  somewhat  less  relevant,  as  this  corres¬ 
ponds  to  a  quadratic  trend  pattern  that  does  not  occur  so  much  in  practice. 
As  the  test  of  parameter  restrictions  is  much  easier  than  that  of  parameter 
inequalities,  one  usually  takes  as  null  hypothesis  that  the  trend  is  stochastic 
and  as  alternative  that  the  trend  is  deterministic.  By  subtracting  yt_\  from 
both  sides  of  the  above  test  equation,  it  can  be  rewritten  as 

Ayt  =  a  +  Pt  +  pyt-i  +  st,  (7.24) 

where  p  =  (j)  —  1 .  The  null  hypothesis  of  a  stochastic  trend  and  the  alterna¬ 
tive  hypothesis  of  a  deterministic  trend  can  be  formulated  in  terms  of  the 
following  two  parameter  restrictions: 

Ho  :  p  =  0  and  p  =  0  (stochastic  trend), 

Hi :  (—  2  <  )  p  <  0  and  p  ^  0  (deterministic  trend). 

The  case  p  <  —2,  or  equivalently  0  <  —  1,  is  of  little  practical  importance,  so 
that  the  relevant  alternative  situation  of  a  (trend)  stationary  time  series 
corresponds  to  p  <  0. 

Dickey-Fuller  F-test 

The  above  two  parameter  restrictions  can  be  tested  by  the  usual  T-test. 
Because  of  the  lagged  regressor  yt~\,  the  equation  has  n  —  1  effective  obser¬ 
vations  and  k  =  3  parameters  under  the  alternative  hypothesis.  So  the 
T-statistic  is  given  by 
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F=(e/ReR-  e'e)/2 
e'e/ (n  —  4) 

Here  e'e  is  the  residual  sum  of  squares  in  the  model  (7.24)  without  parameter 
restrictions  and  e'ReR  is  the  residual  sum  of  squares  in  the  stochastic  trend 
model  with  /?  =  p  =  0  —  that  is,  in  the  random  walk  with  drift.  The  null 
hypothesis  of  a  stochastic  trend  is  rejected  for  large  values  of  the  F-statistic. 
However,  the  distribution  of  this  statistic  is  non-standard  as  it  is  not  equal  to 
the  conventional  F( 2,  n  —  4)  distribution,  not  even  asymptotically.  The 
reason  is  that,  under  the  null  hypothesis,  the  series  yt  contains  a  unit  root, 
so  that  the  regressor  yt_\  in  the  test  equation  is  non-stationary.  That  is,  the 
stability  condition  of  Section  4.1.2  (p.  193)  is  not  satisfied  as 

plim(i^"=2  yj_ | )  =  oo  for  series  with  a  stochastic  trend. 

Critical  values  for  the  test  were  obtained  by  Dickey  and  Fuller.  The  5  per 
cent  critical  values  for  this  test  are  given  in  Exhibit  7.16  (a).  For  large  samples, 
the  5  per  cent  critical  value  of  the  standard  F(2,  n  —  4)  distribution  is  3.00,  but 
in  our  trend  testing  problem  the  critical  values  are  larger  than  6.  The  1  per 
cent  critical  values  range  from  9.8  for  n  =  50  to  8.2  for  large  samples,  and 
these  values  are  also  around  twice  as  much  as  the  1  per  cent  critical  value  of 
the  F(2,n  —  4)  distribution  that  is  4.61  in  large  samples.  Exhibit  7.16  also 
contains  critical  values  for  other  tests  that  will  be  explained  below. 

Unit  root  test  and  Dickey-Fuller  t-distribution 

Instead  of  the  above  F-test,  in  practice  one  often  tests  the  single  restriction 
that  4>  =  1  against  the  alternative  that  4>  <  1.  This  is  called  a  unit  root  test. 
Then  the  null  hypothesis  of  a  stochastic  trend  against  the  alternative  of  a 
deterministic  trend  corresponds  to  the  one-sided  test 

Ho  :  p  =  0  (stochastic  trend), 

Hi  :p  <  0  (no  stochastic  trend). 

The  test  is  based  on  the  t-value  of  p  in  the  regression  (7.24).  This  t-value  is 
denoted  by  t(p).  The  null  hypothesis  of  a  stochastic  trend  is  rejected  if  t(p)  is 
significantly  smaller  than  zero  —  that  is,  if  it  falls  below  the  relevant  (nega¬ 
tive)  critical  value.  For  the  same  reasons  as  before,  t(p)  does  not  follow  the 
t-distribution,  not  even  asymptotically.  The  distribution  of  t(p)  in  the  test 
equation  (7.24)  depends  on  the  value  of  /l.  If  the  DGP  actually  has  p  =  0  and 
/?  =  0,  which  is  the  relevant  case  under  the  null  hypothesis  of  a  stochastic 
trend,  then  the  distribution  of  t(p)  in  the  test  equation  (7.24)  is  called 
the  Dickey-Fuller  distribution.  The  5  per  cent  critical  values  are  given  in 
Exhibit  7.16  (a).  Whereas  the  one-sided  critical  value  of  the  conventional 
t-distribution  is  around  —1.645  in  large  samples,  the  Dickey-Fuller  critical 
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DATA:  TREND 

Test  equation:  Ay, 

—  oc  +  P  t  +  p  yt- 1  +  St 

Sample  size  (ft) 

F- test:  Ho  :  fl  =  p  =  0 

t-test:  Ho  :  p  =  0,  Hi  :  p  <  0 
DGP  has  trend  parameter  f)  =  0 

50 

6.73 

-3.49 

100 

6.49 

-3.45 

500 

6.30 

-3.42 

00 

6.25 

-3.41 

The  critical  values  apply  for  the  DGP  with  (1  =  p  =  0  and  do  not  depend  on  a. 
The  Dickey-Fuller  f-test  corresponds  to  the  last  column. 

Bold  numbers  indicate  the  asymptotic  test  values  that  are  used  most  often. 


( b ) 


DATA:  NO  CLEAR  TREND 

Test  equation:  Ay,  =  a  +  py,_i  +  fi, 


Sample  size  (n) 

F-test:  Ho  :  a  =  p  =  0 

f-test:  Ho  :  p  =  0,  Hi  :  p  <  0 
DGP  has  constant  term  a  =  0 

50 

4.81 

-2.92 

100 

4.74 

-2.89 

500 

4.65 

-2.86 

oo 

4.60 

-2.86 

The  critical  values  apply  for  the  DGP  with  a  =  p  =  0. 

The  Dickey-Fuller  f-test  corresponds  to  the  last  column. 

Bold  numbers  indicate  the  asymptotic  test  values  that  are  used  most  often. 


Exhibit  7.16  Unit  root  tests 

Critical  values  (for  5%  significance  level)  of  unit  root  tests  for  data  with  a  clear  trend  direction 
(a)  and  for  data  without  a  clear  trend  direction  ( b ).  The  critical  values  are  obtained  by 
simulation. 


value  is  around  —3.41  in  large  samples  —  that  is,  it  is  about  twice  as  large 
again.  Simulation  evidence  of  this  large  shift  in  the  distribution  is  left  as  an 
exercise  (see  Exercise  7.14).  It  has  been  shown  that  t(p)  in  the  test  equation 
(7.24)  has  asymptotically  the  standard  normal  distribution  if  p  =  0  and 
P  7^  0.  However,  this  situation  is  not  so  relevant  in  unit  root  testing  because 
under  the  null  hypothesis  (with  </>  =  1  and  /?  ^  0)  the  series  would  contain  a 
quadratic  trend. 

Test  for  data  without  a  clear  overall  trend  direction 

If  the  data  show  prolonged  upward  and  downward  patterns  but  no  clear 
overall  trend  direction,  then  the  deterministic  trend  term  can  he  dropped 
from  (7.24),  so  that  the  test  equation  simplifies  to 


Ay,  =  a  +  pyt-i  +st. 
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In  Section  7.3.1  we  showed  that,  for  the  stochastic  trend  model  with  p  =  0, 
the  parameter  a  introduces  a  deterministic  trend  in  the  series.  As  we  started 
from  the  assumption  that  the  series  has  no  clear  overall  trend  direction,  the 
relevant  parameter  restrictions  for  a  stochastic  trend  are  that  p  =  0  and 
a  =  0.  The  alternative  of  a  stationary  process  corresponds  to  —2  <P<  0, 
and  in  this  case  a  is  included  to  model  the  possibly  non-zero  mean  of  the 
process.  So  the  testing  problem  becomes 

Ho  :  p  =  0  and  a  =  0  (stochastic  trend), 

Hi  :  (— 2  <  )  p  <  0  and  a  ^  0  (no  trend). 

This  can  again  be  tested  by  the  T-test.  The  distribution  is  again  non-standard, 
and  the  critical  values  differ  from  the  ones  obtained  for  the  test  (7.24)  where  a 
deterministic  trend  is  included.  The  5  per  cent  critical  values  are  in  Exhibit 
7.16  (b)  and  range  from  around  4.8  to  4.6  in  large  samples.  This  is  approxi¬ 
mately  midway  between  the  conventional  E-values  («3)  and  the  E-values 
(  6)  that  apply  for  the  test  equation  (7.24)  with  trend  term  included. 

Also  in  this  case  one  often  uses  a  t-test  instead  of  the  E-test,  so  that 
Ho  :  p  =  0  is  tested  against  Hi :  p  <  0.  Under  the  null  hypothesis  of  a  stochas¬ 
tic  trend  (p  =  a  =  0),  the  t-value  of  p  in  the  test  equation 
A yt  =  a  +  pyt-i  +  st  again  follows  a  non-standard  distribution.  The  relevant 
Dickey-Fuller  distribution  differs  from  the  one  that  applies  for  the  test 
equation  (7.24),  as  the  deterministic  trend  term  (/it)  is  now  omitted.  The  5 
per  cent  critical  values  are  now  around  —2.9,  well  below  the  conventional 
value  of  —  1.645  (see  Exhibit  7.16  (b)).  We  mention  that  for  p  =  0  and  0 
the  t-test  of  p  asymptotically  has  the  standard  normal  distribution.  However, 
the  case  p  =  0  and  a  ^  0  is  not  so  relevant  here,  as  in  this  case  the  series 
contains  a  clear  trend  direction  so  that  (7.24)  with  the  deterministic  trend 
term  (/it)  included  would  be  the  correct  test  equation. 

Choice  of  appropriate  test  equation 

In  practice  it  is  sometimes  not  so  clear  whether  the  time  series  has  a  clear 
overall  trend  direction  or  not.  Then  the  question  arises  whether  the  trend 
term  (/it)  in  (7.24)  should  be  included  in  the  test  regression  or  not.  A  possible 
method  is  to  start  with  this  term  included  and  to  drop  it  if  it  is  not  significant. 
However,  if  the  series  has  a  stochastic  trend  so  that  p  =  0  in  (7.24),  then  the 
t-statistic  of  /I  in  (7.24)  (with  a  constant,  the  deterministic  trend  t  and  yt- 1 
included  as  regressors)  does  not  follow  the  standard  t-distribution.  The  (two- 
sided)  5  per  cent  critical  value  for  /?  is  around  3.1  instead  of  the  conventional 
value  of  2.0. 

In  practice,  often  the  best  way  to  proceed  is  to  plot  the  data  and  to  exclude 
the  trend  term  (/it)  only  if  there  is  no  overall  upward  or  downward  trend.  In 
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particular,  one  should  make  sure  not  to  exclude  the  trend  term  if  the  series 
has  a  clear  direction,  as  otherwise  the  alternative  hypothesis  of  the  test 
corresponds  to  a  stationary  time  series,  so  that  the  test  has  little  chance  of 
rejecting  the  null  hypothesis  of  a  stochastic  trend.  Simulation  evidence  of  this 
is  left  as  an  exercise  (see  Exercise  7.14).  Sometimes,  for  time  series  without 
clear  trend  direction  that  move  around  the  value  zero,  the  test  equation  is 
simplified  even  further  by  excluding  both  the  constant  and  the  trend  so  that 
the  test  regression  becomes 


A  yt  =  pyt-i  +et. 

The  null  hypothesis  of  a  stochastic  trend  corresponds  to  p  =  0  and  the 
alternative  of  a  stationary  process  to  —2  <P  <  o.  This  test  equation  makes 
sense  only  if  the  series  has  no  clear  trend  direction  and  the  series  moves 
around  a  mean  level  zero.  The  reason  is  that,  under  the  alternative  hypoth¬ 
esis,  the  series  has  mean  zero.  The  (one-sided)  5  per  cent  critical  value  is 
around  —1.95  (instead  of  the  conventional  value  of  —1.645). 

Overview  of  unit  root  testing 

We  summarize  the  above  results.  In  most  cases  where  one  is  interested  in 
investigating  the  nature  of  the  trend  in  a  series,  the  relevant  test  regression  is 

(7.24)  or  its  generalization  (7.25),  which  will  be  discussed  in  the  sequel.  The 
null  hypothesis  of  a  stochastic  trend  is  tested  by  the  T- test  on  p  =  fl  =  0.  As 
an  alternative,  one  can  apply  the  Dickey-Fuller  t-test  on  the  single  restriction 
that  p  =  0  against  the  alternative  that  p  <  0.  The  null  hypothesis  of  a 
stochastic  trend  is  rejected  in  favour  of  the  alternative  of  a  deterministic 
trend  if  the  T-test  takes  large  values,  or  if  the  t-test  takes  large  negative 
values.  The  tests  do  not  follow  the  conventional  F-  and  ^-distributions.  In 
large  enough  samples  (n  >  100)  the  critical  values  of  the  T-test  are  roughly 
around  6.5  and  those  of  the  t-test  around  —3.5.  As  a  rule  of  thumb,  the 
presence  of  a  stochastic  trend  is  rejected  for  T  >  6.5  or  for  t  <  —3.5.  Critical 
values  are  given  in  Exhibit  7.16  (a). 

For  time  series  with  prolonged  up-  and  downswings  but  without  clear 
trend  direction,  the  trend  term  (/it)  can  he  dropped  from  the  test  equation 

(7.24) .  The  null  hypothesis  of  a  stochastic  trend  can  then  be  tested  by  the 
T-test  on  p  =  a  =  0  or  by  the  one-sided  t-test  on  p  =  0  against  p  <  0.  The 
relevant  critical  values  of  these  tests  are  in  Exhibit  7.16  (b),  with  values  (for 
n  >  100)  of  roughly  4.7  for  the  T-test  and  —2.9  for  the  t-test. 

Phillips-Perron  test 

The  above  tests  are  valid  under  the  assumption  that  the  error  terms  Et  in  the 
relevant  test  equation  are  normally  distributed  white  noise.  In  practice,  time 
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series  are  often  also  characterized  by  short-term  fluctuations  in  the  sense  that 
the  detrended  series  is  correlated  over  time.  The  above  models  neglect  this,  so 
that  the  residuals  will  be  serially  correlated  and  the  critical  values  will  not  be 
valid.  For  the  Dickey-Fuller  t-test,  we  can  apply  a  Newey-West  correction 
for  serial  correlation  to  compute  the  standard  error  of  the  estimated  param¬ 
eter  p.  This  correction  is  based  on  the  GMM  method,  as  was  discussed  in 
Section  5.5.2  (p.  359-60).  If  this  correction  is  applied,  the  Dickey-Fuller 
critical  values  remain  valid  asymptotically.  The  f-test  based  on  the  Newey- 
West  standard  error  of  p  is  called  the  Phillips-Perron  test. 


Derivation  of  augmented  Dickey-Fuller  test  equation 

An  alternative  method  is  to  model  the  short-run  correlations  by  including  lagged 
values  of  yt  in  the  test  equation  so  that 


4>(L)yt  =  a  +  fit  +  st, 

where  the  AR-polynomial  0(z)  =  1  —  <j>\  z  —  •  •  •  —  (f)pzp  has  degree  p.  We  assume 
that  the  series  yt  is  either  integrated  of  order  I  or  trend  stationary.  The  null 
hypothesis  of  a  stochastic  trend  corresponds  to  the  case  where  ft  =  0  and  yt 
is  integrated  of  order  1.  In  this  case,  the  AR  polynomial  0(z)  should  have 
a  unit  root,  so  that  <^>(1)  =  0.  Then  the  polynomial  can  be  factorized  as 
4>{z)  =  (1  —  z)0(z),  so  that  yt  is  an  ARIMA  process.  The  alternative  is  that  yt  is 
trend  stationary  —  that  is,  /f  ^  0  and  the  AR  polynomial  is  stationary,  so  that  all 
the  roots  of  0(z)  =  0  lie  outside  the  unit  circle.  As  0(0)  =  1,  the  requirement  that 
<p(z)  =  0  has  no  solutions  for  |z|  <  1  implies  that  (f)(1)  >  0  in  this  case.  Therefore 
the  testing  problem  can  be  formulated  as  follows  in  terms  of  the  parameters  [>  and 

</>(!)  =  1  -  ELi 


Ho  :  (f)(1)  =  0  and  0  =  0  (stochastic  trend), 

Hi  :  0(1)  >  0  and  0^0  (deterministic  trend). 

For  the  case  of  an  AR  polynomial  of  order  p  =  1  we  have  0(1)  =  1  —  0,  so  that 
0(1)  =  0  corresponds  to  0  =  1  and  0(1)  >  0  to  0  <  1,  which  is  the  case  discussed 
before  in  terms  of  the  test  equation  (7.24). 

The  following  technical  results  are  helpful  to  write  the  above  testing 
problem  for  AR (p)  models  in  a  more  convenient  form.  Define  the  polynomial 
\f/(z)  =  0(z)  —  (f)(1)  z;  then  0(1)  =  0,  so  that  \f/(z)  can  be  factorized  as 
0(z)  =(1  —  z)p(z)  for  some  polynomial  p(z)  =  1  —  P\Z  —  •  •  •  —  Pp-xZ^1  of  degree 
(p  —  1).  Now  define  p  =  —0(1)  and  rewrite  the  polynomial  0(z)  as 

0(z)  =  0(z)  +  0(l)z  =  0(l)z  +  (1  -  z)p(z) 


p- 1 

=  -pz+  (1  -z)  -  (1  -  z)^2pkzk. 


k=l 
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If  we  use  this  result,  the  test  equation  (j)(L)yt  =  a.  +  fit  +  et  can  be  written  as 
follows: 


Ayt  —  oc  +  fit  +  pyt- 1  +  P\Ayt~\  +  •  •  •  +  Pp-\  Ayt-p+i  +  £f  ■  (7.25) 


Augmented  Dickey-Fuller  test 

As  p  =  —  0(1),  the  testing  problem  can  be  formulated  as  follows  in  terms  of 
the  augmented  Dickey-Fuller  test  equation  (7.25): 

Ho  :  p  =  0  and  /I  =  0  (stochastic  trend), 

Hi :  p  <  0  and  /?  ^  0  (deterministic  trend). 

The  test  can  be  performed  by  the  F- test,  or  by  the  f-test  on  p.  This  is  called  the 
augmented  Dickey-Fuller  (ADF)  test.  The  test  equation  is  simply  obtained 
from  the  basic  test  equation  (7.24)  by  adding  lagged  values  of  Ayt  as  add¬ 
itional  regressors.  Note  that,  under  the  null  hypothesis  of  a  stochastic  trend, 
the  series  yt  is  integrated  of  order  1,  so  that  the  added  regressors  are  all 
stationary.  The  asymptotic  critical  values  (for  n  —*  oo)  of  the  ADF  test  are  the 
same  as  the  ones  for  the  Dickey-Fuller  test  reported  in  Exhibit  7.16.  Al¬ 
though  the  finite  sample  critical  values  are  different,  they  are  still  approxi¬ 
mately  valid  provided  that  the  lag  length  p  is  relatively  small  compared  to  the 
sample  size  n. 

The  lag  order  p  in  the  ADF  test  equation  can  be  selected,  for  instance,  by 
starting  with  a  large  value  for  p  and  then  sequentially  reducing  the  order  by 
testing  for  the  significance  of  the  coefficient  pp_\  of  the  largest  lag.  As  the 
regressors  A yt_il  are  stationary,  it  can  be  shown  that  the  t-tests  for  these 
coefficients  follow  the  standard  t-distribution.  Another  method  to  select  the 
lag  order  p  is  to  start  with  equation  (7.24)  and  then  to  increase  the  order  p 
until  the  residuals  have  no  significant  autocorrelation  anymore. 

Testing  for  integration  of  order  2 

The  above  results  hold  true  for  testing  the  null  hypothesis  that  the  process  has 
a  single  unit  root  against  the  alternative  that  it  is  stationary  around  a  deter¬ 
ministic  trend.  If  the  process  A yt  is  possibly  non-stationary  —  that  is,  if  yt  is 
possibly  integrated  of  order  2  —  then  one  can  proceed  as  follows.  First  test  the 
null  hypothesis  that  yt  is  integrated  of  order  2  against  the  alternative  that  it  is 
integrated  of  order  1.  This  can  be  tested  by  considering  the  differenced  series 
Zt,  which  is  integrated  of  order  1  under  the  null  hypothesis  and  (trend) 
stationary  under  the  alternative.  For  instance,  one  can  apply  the  ADF  test 
equation  (7.25)  for  the  series  Zt-  If  the  null  hypothesis  is  rejected  —  that  is,  if  Zt 
is  trend  stationary  —  then  yt  is  integrated  of  order  at  most  1.  As  a  second 
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step,  one  can  test  whether  yt  is  integrated  of  order  1  against  the  alternative 
that  it  is  (trend)  stationary  by  applying  the  ADF  test  for  the  series  yt. 

Remark  on  critical  values  in  unit  root  tests 

Exhibit  7.16  shows  that  critical  values  of  unit  root  tests  depend  on  the 
inclusion  of  deterministic  components  (constant,  trend)  in  the  equation. 
The  critical  values  also  depend  on  other  possible  deterministic  components, 
such  as  breaks  in  the  level  a  or  in  the  trend  slope  /?  or  the  presence  of  seasonal 
components.  Therefore  one  should  first  make  sure  that  such  components  are 
modelled  in  an  appropriate  way  before  unit  root  tests  are  applied,  as  other¬ 
wise  the  test  outcomes  may  be  misleading.  The  effect  of  breaks  is  further 
discussed  in  Section  7.4.1. 
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Example  7.14:  Industrial  Production  (continued) 

We  continue  our  analysis  of  the  series  yt  consisting  of  the  logarithm  of  US 
quarterly  industrial  production.  We  will  discuss  (i)  the  data,  (ii)  the  results 
of  Dickey-Fuller  tests,  and  (iii)  the  results  of  Phillips-Perron  and  augmented 
Dickey-Fuller  tests. 

(i)  The  data 

Exhibit  7.1  (a)  in  Example  7.1  shows  that  this  series  is  characterized  by  an 
upward  trend.  In  Example  7.13  we  considered  different  trend  models,  and 
now  we  will  test  for  the  nature  of  the  trend  of  this  series.  Because  of  the  clear 
overall  trend  direction,  we  should  always  include  a  deterministic  trend  term 
in  testing  the  null  hypothesis  of  a  stochastic  trend.  Otherwise,  if  this  trend 
term  is  omitted,  the  alternative  hypothesis  would  correspond  to  a  stationary 
process  and  there  would  be  no  chance  of  rejecting  the  null  hypothesis  in 
favour  of  the  alternative.  We  use  quarterly  data  over  the  period  1961-94,  so 
that  there  are  n  =  136  observations. 

(ii)  Dickey-Fuller  tests 

We  start  with  the  basic  test  equation  (7.24).  The  results  are  in  Panels  1  and  2 
of  Exhibit  7.17.  The  test  values  are  F  =  5.46  (which  is  smaller  than  the  5  per 
cent  critical  value  of  around  6.45)  and  t  =  —2.80  (which  is  larger  than  the 
critical  value  of  around  —3.44).  Therefore  the  null  hypothesis  of  a  unit  root  is 
not  rejected  (at  5  per  cent  significance).  However,  this  test  equation  is  not 
well  specified,  as  the  residuals  show  serial  correlation.  Panel  3  of  Exhibit  7.17 
shows  the  values  of  the  first  five  S(P)ACF  of  the  OLS  residuals  of  the  regres¬ 
sion  (7.24).  For  instance,  the  correlations  at  lags  four  and  five  are  significant, 
indicating  the  possible  presence  of  seasonal  effects.  Such  effects  could  well  be 
present  for  quarterly  production  figures. 
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Panel  1:  ADF  Test  Statistic 

-2.797508 

5%  Critical  Value 

-3.44331 

Dickey-Fuller  Test  Equation;  Dependent  Variable:  D(Y) 

Sample:  1961:1  1994:4;  Included  observations:  136 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Y( —  1 ) 

C 

@TREND(  1961:1) 

-0.065763 

0.261410 

0.000397 

0.023508 

0.088632 

0.000175 

-2.797508 

2.949402 

2.263446 

Panel  2:  F-test  on  Y(  — 1)  and  TREND 
Null  Hypothesis:  C(1)=0,  C(3)=0 
F-statistic  5.459897 


Panel  3:  S(P)ACF  of  residuals  of  Dickey-Fuller  regression 
Sample:  1961:1  1994:4;  Included  observations:  136 


Lag 

SACF 

SPACF 

Q-Stat 

Prob 

1 

0.086 

0.086 

1.0180 

0.313 

2 

0.196 

0.190 

6.3727 

0.041 

3 

-0.123 

-0.160 

8.5222 

0.036 

4 

0.310 

0.318 

22.176 

0.000 

5 

-0.298 

-0.379 

34.894 

0.000 

Panel  4:  Phillips-Perron  Test  —2.897757  5%  Critical  Value  -3.4433 


Panel  5:  ADF  Test  Statistic 

-2.879186 

5%  Critical  Value 

-3.44331 

Augmented  Dickey-Fuller  Test  Equation;  Dependent  Variable:  D(Y) 

Sample:  1961:1  1994:4;  Included  observations:  136 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Y( — 1 ) 

-0.062996 

0.021880 

■2.879186 

D(Y(— 1)) 

0.267290 

0.080354 

3.326419 

D(Y(— 2)) 

0.111804 

0.077854 

1.436074 

D(Y(— 3)) 

-0.150127 

0.076845 

-1.953643 

D(Y(— 4)) 

0.345637 

0.076304 

4.529756 

D(Y(— 5)) 

-0.289789 

0.079154 

-3.661092 

C 

0.247258 

0.081971 

3.016418 

@TREND(  1961:1) 

0.000397 

0.000166 

2.398142 

Panel  6:  F-test  on  Y(  — 1)  and  TREND 
Null  Hypothesis:  C(1)=0,  C(8)=0 
F-statistic  5.493062 


Panel  7:  S(P)ACF  of  residuals  of  Augmented  Dickey-Fuller  regression 
Sample:  1961:1  1994:4;  Included  observations:  136 


Lag _ SACF _ SPACF _ Q-Stat _ Prob 


1 

0.001 

0.001 

6.E-05 

0.994 

2 

-0.104 

-0.104 

1.5259 

0.466 

3 

0.049 

0.049 

1.8584 

0.602 

4 

0.013 

0.002 

1.8826 

0.757 

5 

-0.015 

-0.004 

1.9127 

0.861 

Exhibit  7.17  Industrial  Production  (Example  7.14) 

Tests  on  the  nature  of  the  trend  in  the  series  of  US  industrial  production  (in  logarithms): 
Dickey-Fuller  f-test  (Panel  1)  and  F-test  (Panel  2),  S(P)ACF  of  residuals  of  test  equation  (Panel 
3),  Phillips-Perron  test  (Panel  4),  ADF  f-test  (Panel  5)  and  F-test  (Panel  6),  and  S(P)ACF  of 
residuals  of  ADF  test  equation  (Panel  7). 
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(iii)  Phillips-Perron  and  augmented  Dickey-Fuller  tests 

The  foregoing  results  show  that  we  should  correct  for  the  short-run  correl¬ 
ations  that  are  present  in  the  time  series.  Panel  4  of  Exhibit  7.17  shows  the 
result  of  the  Phillips-Perron  test.  The  computed  f-value  is  slightly  lower 
(—2.90  as  compared  to  an  OLS  t-value  of  —2.80).  Still,  this  is  well  above 
the  critical  value  of  —3.44,  so  that  the  null  hypothesis  of  a  stochastic  trend 
is  not  rejected.  Panels  5  and  6  of  Exhibit  7.17  show  the  result  of  the 
augmented  Dickey-Fuller  test  with  five  lagged  terms  A yt  included  in  the  test 
equation.  The  relevant  t-value  is  now  —2.88,  and  again  we  cannot  reject 
the  presence  of  a  unit  root.  The  F- test  has  value  F  =  5.49,  which  is  below  the 
5  per  cent  critical  value  (6.45),  so  the  presence  of  a  unit  root  is  also  not 
rejected  by  the  F- test.  Panel  7  of  Exhibit  7.17  contains  the  S(P)ACF  of  the 
residuals  of  this  ADF  test  equation.  These  residuals  do  not  contain  any 
significant  correlation  anymore,  so  that  this  test  equation  is  well  specified. 
The  overall  conclusion  is  that  the  logarithmic  series  of  industrial  production 
contains  a  unit  root  —  that  is,  the  trend  is  stochastic.  The  modelling  of  the 
seasonal  components  of  this  series  is  further  discussed  in  Example  7.16  in 
the  next  section. 


Example  7.15:  Dow-Jones  Index  (continued) 

As  a  second  example  we  consider  the  series  yt  consisting  of  the  logarithm  of 
the  daily  Dow-Jones  index.  This  series  is  shown  in  Exhibit  7.2  ( b )  in  Example 
7.2.  The  series  contains  a  clear  upward  trend  and  consists  of  n  =  2528 
observations.  We  will  (i)  test  for  the  presence  of  a  unit  root,  and  (ii)  test  for 
the  presence  of  two  unit  roots. 

(i)  Test  for  the  presence  of  a  unit  root 

We  use  the  ADF  test  equation  (7.25)  with  five  lagged  terms.  This  is  because 
the  series  consists  of  daily  data,  so  that  the  five  lagged  terms  can  pick  up 
possible  weekly  effects.  Because  of  the  large  number  of  observations  we 
can  use  the  asymptotic  critical  values  (for  n  =  oo)  in  Exhibit  7.16.  The 
results  in  Panels  1  and  2  of  Exhibit  7.18  show  that  the  null  hypothesis  of  a 
stochastic  trend  cannot  be  rejected,  as  F  =  4.20  <  6.25  and  t  =  —2.54  > 
—3.41.  Panel  3  shows  that  the  residuals  of  the  test  equation  are  not  serially 
correlated. 

(ii)  Test  for  the  presence  of  two  unit  roots 

Exhibit  7.2  (c)  shows  that  the  series  of  first  differences  Ayf  does  not  display  a 
clear  trend  direction.  We  can  test  whether  the  series  yt  has  two  unit  roots  by 
testing  whether  the  series  Ayf  has  a  unit  root.  The  ADF  test  equation  (7.24)  in 
Panels  4  and  5  of  Exhibit  7.18  give  test  values  F  =  1189  and  t  =  —48.77.  So 
the  presence  of  a  second  unit  root  is  clearly  rejected.  The  S(P)ACF  of  the 
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Panel  1:  ADF  Test  Statistic 

-2.539298 

5%  Critical  Value 

-3.4142 

Augmented  Dickey-Fuller  Test  Equation;  Dep.  Variable:  D(LOGDJ) 
Sample(adjusted)  7  2528;  Included  obs  2522  after  adjusting  endpoints 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

LOGDJ(-l) 

-0.004428 

0.001744 

■2.539298 

D(LOGDJ(— 1)) 

0.030979 

0.019923 

1.554888 

D(LOGDJ(— 2)) 

-0.016979 

0.019930 

■0.851933 

D(LOGDJ(— 3)) 

-0.044691 

0.019907 

■2.245021 

D(LOGDJ(— 4)) 

-0.006859 

0.019923 

■0.344270 

D(LOGDJ(— 5)) 

-0.013512 

0.019916 

■0.678443 

C 

0.034227 

0.013418 

2.550807 

@TREND(1) 

3.04E-06 

1.09E-06 

2.787442 

Panel  2:  F-test  on  LOGDJ(-l)  and  TREND 
Null  Hypothesis:  C(1)=0,  C(8)=0 

F-statistic  4.201230 


Panel  3:  S(P)ACF  of  residuals  of  ADF  regression  for  LOGDJ 
Sample:  7  2528;  Included  observations:  2522 


Lag _ SACF _ SPACF _ Q-Stat _ Prob 


1 

0.000 

0.000 

0.0002 

0.988 

2 

0.000 

0.000 

0.0004 

1.000 

3 

-0.002 

-0.002 

0.0148 

1.000 

4 

-0.001 

-0.001 

0.0167 

1.000 

5 

-0.001 

-0.001 

0.0174 

1.000 

Panel  4:  ADF  Test  Statistic 

-48.76587 

5%  Critical  Value 

—3.4142 

Augmented  Dickey-Fuller  Test  Equation;  Dep  Variable:  D(DLOGDJ) 
Sample(adjusted)  3  2528;  Included  obs  2526  after  adjusting  endpoints 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

DLOGDJ(-l) 

-0.970458 

0.019900 

-48.76587 

C 

0.000128 

0.000355 

0.361549 

@TREND(1) 

3.27E-07 

2.43E-07 

1.342188 

Panel  5:  F-test  on  DLOGDJ(-l)  and  TREND 
Null  Hypothesis:  C(1)=0,  C(3)=0 

F-statistic  1189.055 


Panel  6:  S(P)ACF  of  residuals  of  ADF  regression  for  D(LOGDJ) 
Sample:  3  2528;  Included  observations:  2526 


Lag _ SACF _ SPACF _ Q-Stat _ Prob 


1 

0.001 

0.001 

0.0008 

0.978 

2 

-0.018 

-0.018 

0.7986 

0.671 

3 

-0.046 

-0.046 

6.2523 

0.100 

4 

-0.009 

-0.010 

6.4792 

0.166 

5 

-0.013 

-0.015 

6.9391 

0.225 

Exhibit  7.18  Dow-Jones  Index  (Example  7.15) 

Augmented  Dickey-Fuller  test  on  a  unit  root  in  the  logarithms  of  the  Dow-Jones  (LOGDJ), 
t-test  (Panel  1),  F-test  (Panel  2),  S(P)ACF  of  residuals  of  the  ADF  test  equation  (Panel  3), 
and  Dickey-Fuller  test  on  a  unit  root  in  the  series  of  first  differences  (DLOGDJ),  t-test 
(Panel  4),  F-test  (Panel  5),  S(P)ACF  of  residuals  of  the  DF  test  equation  (Panel  6). 
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residuals  of  this  test  equation  in  Panel  6  of  Exhibit  7.18  indicates  no  serial 
correlation,  which  justifies  the  use  of  the  simple  equation  (7.24). 

=s>  Exercises:  S:  7.12f,  7.14a-d;  E:  7.17c,  d,  7.20a,  b,  7.21b,  7.23a,  d,  7.24a. 


7.3.4  Seasonality 

Time  series  components 

When  time  series  are  measured  for  instance  every  month  or  every  quarter, 
they  may  contain  pronounced  seasonal  variation.  The  seasonal  component  in 
a  time  series  refers  to  patterns  that  are  repeated  over  a  one-year  period  and 
that  average  out  in  the  long  run.  The  patterns  that  do  not  average  out  are 
included  in  the  constant  and  the  trend  components  of  the  model.  Whereas  the 
trend  is  of  dominant  importance  in  long-term  forecasting,  the  seasonal 
component  is  very  important  in  short-term  forecasting  as  it  is  often  the 
main  source  of  short-run  fluctuations.  Seasonal  effects  may  he  detected 
from  plots  of  the  time  series,  and  also  from  plots  of  the  seasonal  series  that 
consist  of  the  observations  in  the  same  month  or  quarter  over  the  different 
years.  The  autocorrelations  of  seasonal  time  series  often  show  positive  peaks 
at  the  seasonal  lag  and  its  multiples  —  that  is,  at  lags  12,  24,  36  (and  so  on) 
for  monthly  series  and  at  lags  4,  8,  12  (and  so  on)  for  quarterly  series. 

If  trend  and  seasonal  components  are  additive,  then  the  time  series  may  be 
decomposed  as 


yt  —  Tt  +  St  +  Rt- 


Here  Tt  denotes  the  trend  component  and  St  the  seasonal  component.  The 
component  Rt  stands  for  a  stationary  process  that  consists  of  transient 
deviations  from  the  trend  and  seasonal  components.  If  the  effects  are  multi¬ 
plicative  this  is  modelled  as  yt  =  TtStRt  —  that  is,  the  three  components 
multiplied  with  each  other  produce  the  observed  series.  The  multiplicative 
model  can  easily  be  transformed  to  an  additive  model  by  taking  logarithms, 
so  that  log  ( yt )  =  log  (Tf)  +  log  ( St )  +  log  (Rt).  In  this  section  we  will  there¬ 
fore  discuss  only  additive  models. 

Decomposition  of  time  series  and  the  Census  X-12  method 

Stated  in  general  terms,  for  series  with  additive  components  the  trend  Tt  can 
be  obtained  by  long-term  moving  averages  of  the  series  yt.  After  the  trend  has 
been  estimated,  the  seasonal  components  can  be  obtained  by  averaging 
the  detrended  values  (yt  —  Tt)  that  pertain  to  the  same  period  of  the  year 
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(the  same  month  or  the  same  quarter).  Finally,  the  stationary  remainder 
(yt  —  Tt  —  St)  can  be  modelled  by  an  ARMA  process  to  take  care  of  short¬ 
term  deviations  of  the  series  yt  from  the  trend  and  seasonal  patterns.  Fore¬ 
casts  of  the  series  yt  can  then  be  computed  by  adding  the  forecasts  of  the 
trend,  the  seasonal  component,  and  the  stationary  part.  A  well-known 
method  of  this  type  is  the  so-called  Census  X-12  method  to  construct  the 
seasonal  component  St  and  the  corresponding  seasonally  adjusted  time  series 
(yt  —  St).  For  quarterly  data  with  additive  trend  and  seasonal  components, 
the  idea  is  as  follows.  A  simple  first  estimate  of  the  trend  is  given  by  the 
centred  yearly  average  Tt  =  \  ( \yt-i  +  yt-\  +  yt  +  yt+i  +  \yt+i)-  The  sea¬ 
sonal  index  Q;  of  quarter  j  is  then  defined  as  the  sample  average  of  all  values 
of  the  trend-adjusted  series  (yt  —  Tt)  that  fall  in  the  ;th  quarter,  /'  =  1,  •  •  • ,  4. 
The  seasonal  component  is  defined  by  St  =  (Q;  —  Q),  where  /  is  the  quarter  of 
observation  t  and  Q  is  the  average  of  the  four  seasonal  indices.  The  value  of 
Q  is  subtracted  in  computing  the  seasonal  component  St  because  this  com¬ 
ponent  then  sums  up  to  zero  over  a  year.  The  seasonally  adjusted  series  is 
defined  by  (yt  —  St).  This  adjusted  series  may  be  used  as  a  starting  point  in  a 
second  round  to  construct  a  new  (longer  term)  estimate  of  the  trend  and 
corresponding  new  estimates  of  the  seasonal  components.  This  method  is 
much  used  in  practice,  with  modifications  to  take  care  of  many  kinds  of 
possible  special  properties  of  observed  time  series.  A  disadvantage  of  this 
method  is  that  it  does  not  specify  a  statistical  model,  so  that  it  is  not  possible 
to  apply  statistical  tests  on  the  outcomes. 

Model  with  deterministic  seasonals 

Now  we  discuss  some  parametric  models  for  seasonal  time  series.  As  in  the 
case  of  trends,  one  should  distinguish  deterministic  from  stochastic  seaso¬ 
nals.  For  simplicity  we  consider  again  the  case  of  quarterly  data  with  additive 
seasonals.  The  results  can  be  generalized  for  other  observation  frequencies 
(for  instance,  for  monthly  or  weekly  data)  and  for  series  with  multiplicative 
seasonals. 

Deterministic  seasonals  can  be  modelled  by  seasonal  dummies.  For  in¬ 
stance,  an  AR(1)  model  for  quarterly  data  with  deterministic  trend  and 
seasonal  components  is  given  by 

yt  =  *  T  fit  +  txiDit  +  D^t  +  c(.4D4t  +  4>yt-i  +  £«• 

Flere  D?f  is  a  dummy  variable  with  Dit  =  1  if  the  tth  observation  falls  in  the 
second  quarter  of  a  year  and  Dit  =  0  otherwise,  and  the  dummies  D $t  and 
D4t  for  the  third  and  fourth  quarters  are  defined  in  a  similar  way.  If 
—  1  <  4>  <  1,  then  this  model  can  be  estimated  and  evaluated  by  conventional 
OLS  methods,  and  more  general  ARMA  models  can  be  estimated  as  usual  by 
ML.  If  the  trend  is  stochastic  —  that  is,  if  4>  =  1 — then  the  parameters  are 
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easily  estimated  by  regressing  A yt  on  a  constant,  the  trend  t,  and  the  three 
seasonal  dummies.  In  order  to  perform  a  unit  root  test  —  that  is,  to  test 
whether  (j)  —  1 — the  critical  values  of  Dickey-Fuller  also  apply  for  this 
model  with  deterministic  seasonal  dummies. 

Model  with  stochastic  seasonals 

The  simplest  model  with  stochastic  seasonals  is  given  by  the  seasonal  random 
walk  model 


yt  —  a  +  yt- 4  +  £f- 


This  can  be  written  as  (1  —  L4)yt  =  a  +  st.  The  AR  polynomial  (1  —  z4)  has  a 
unit  root  at  z  =  1  and  three  so-called  seasonal  unit  roots  at  z  =  —  1  and  at 
Z  =  ±i.  As  1  —  z4  =  (1  —  z)(l  +  Z  +  z2  +  z3),  the  series  of  fourth  differences 
can  be  written  as  (1  —  L4)yt  =  (1  —  L)xt,  where  xt  =  (1  +  L  +  L2  +  L3)yt  is 
the  year-total  of  the  series  yt  over  the  last  four  quarters.  So  a  model  for  yt 
with  stochastic  seasonality  implies  that  the  series  xt  of  year  totals  has  a  unit 
root.  This  can  be  tested,  for  instance,  by  applying  the  ADF  test  on  the  series 
xt  of  year-totals. 

Seasonal  ARIMA  models  and  the  ‘airline’  model 

The  above  model  with  stochastic  seasonal  components  can  be  generalized  to 
the  class  of  so-called  seasonal  ARIMA  or  SARIMA  models.  For  quarterly 
data,  a  SARIMA(p,  d,  q)  model  is  defined  by 

</>(L4)(l  -  L4)dyt  =  6(L4)st, 

where  4>(z)  and  9(z)  have  all  their  roots  outside  the  unit  circle.  This  is  an 
ARIMA  model  where  the  value  of  yt  is  related  only  to  the  values  of 
yt~ 4,  yt- 8,  yt- n  (and  so  on)  of  observations  in  the  same  quarter  in  former 
years.  This  is  motivated  by  the  fact  that  seasonal  time  series  often  exhibit  the 
strongest  correlations  at  the  seasonal  lags.  If  non-seasonal  correlations  are 
also  of  importance,  the  seasonal  model  can  be  combined  with  an  ARIMA 
model.  For  instance,  the  so-called  ‘ airline  model ’  is  given  by 

(1  -  L)(l  -  L4)yt  =  (1  +  9\L)(1  +  94L4)et. 

The  right-hand  side  is  an  MA(5)  process  with  parameter  restrictions,  as 
only  lags  1,  4,  and  5  are  present.  The  advantage  of  this  and  other  seasonal 
ARIMA  models  is  that  they  contain  relatively  few  parameters  to  model 
correlations  over  longer  lags.  Note  that  the  above  model  contains  a  double 
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root  at  unity  —  that  is,  the  process  is  assumed  to  be  integrated  of  order  2. 
This  makes  sense,  for  instance,  if  the  series  (1  —  L)yt  contains  strong  seasonal 
correlations  that  die  out  only  very  slowly.  An  alternative  is  to  consider 
models  with  deterministic  trends  and  seasonals  or  combinations  of  stochastic 
trends  with  deterministic  seasonals. 


Choice  between  deterministic  and  stochastic  trends  and 
seasonals 

To  choose  among  the  various  possible  models,  it  is  helpful  to  start  by  testing 
for  the  order  of  integration  of  the  series  yt.  In  practice  the  order  of  integration 
does  not  exceed  two,  and  two  unit  roots  may  be  present  in  the  case  of  a 
stochastic  trend  combined  with  a  stochastic  seasonal.  However,  in  most  cases 
the  order  of  integration  is  either  one  or  zero.  We  can  first  test  the  null 
hypothesis  of  integration  of  order  2  against  the  alternative  of  order  1.  This 
can  be  tested  by  an  ADF  test  on  the  series  of  first  differences  (1  —  L)yt.  If  the 
series  contains  two  unit  roots,  this  suggests  incorporating  a  stochastic  trend 
and  a  stochastic  seasonal  in  the  model.  If  second  order  integration  is  rejected, 
we  can  test  the  null  hypothesis  of  first  order  integration  against  the  alterna¬ 
tive  of  (trend)  stationarity.  Deterministic  components  for  trend  and  seasonals 
can  be  included  in  the  test  equation.  If  the  series  is  integrated  of  order  1,  then 
we  can  include  either  a  stochastic  trend  with  deterministic  seasonals  in  the 
model,  or  a  stochastic  seasonal  with  a  deterministic  trend.  If  the  series  is 
trend  stationary,  we  can  include  deterministic  trend  and  seasonal  terms  in  the 
model. 

Note  that,  if  the  series  is  integrated  of  order  1,  the  transformation 
xt  =  (1  —  T)(l  —  L4)yt  involves  over-differencing.  This  means  that  the 
ARMA  model  for  xt  is  not  invertible  (see  Exercise  7.13).  An  indication  of 
possible  over-differentiation  can  be  obtained  from  the  SACF  of  xt,  as  the 
theoretical  autocorrelations  sum  up  to  —  \  in  this  case  (see  Exercise  7.13). 

Example  7.16:  Industrial  Production  (continued) 

We  continue  our  analysis  of  the  quarterly  series  yt  of  logarithms  of  US 
industrial  production.  We  will  discuss  (i)  the  order  of  integration  of  this 
series,  (ii)  the  nature  of  the  seasonal  component  of  this  series,  (iii)  remarks 
on  two  alternative  models,  and  (iv)  the  time  series  decomposition  obtained 
by  the  Census  X-12  method. 

(i)  Order  of  integration 

In  Example  7.14  in  the  foregoing  section  we  concluded  that  the  time  series 
has  a  unit  root  (see  Exhibit  7.17).  Now  we  investigate  whether  the  series  has 
two  unit  roots  —  that  is,  we  test  whether  the  series  Ayf  of  quarterly  growth 


XM701INP 


608  7  Time  Series  and  Dynamic  Models 


rates  has  a  unit  root.  For  this  purpose  we  apply  the  ADF  test  on  the  series 
Zt  =  A yt.  Four  lags  are  included  to  account  for  possible  seasonal  effects.  As 
the  series  zt  does  not  have  a  clear  trend  direction,  we  exclude  the  determinis¬ 
tic  trend  component  from  the  ADF  test  equation.  The  result  is  shown  in  Panel 
1  of  Exhibit  7.19.  The  relevant  f-value  is  —5.67,  which  is  much  below  the  5 
per  cent  critical  value  of  —2.88  (note  that  the  test  equation  does  not  contain 
the  deterministic  trend  term).  This  means  that  zt  does  not  have  a  unit  root.  So 
we  conclude  that  the  series  yt  is  integrated  of  order  1. 

(ii)  Nature  of  the  seasonal  component 

Panel  2  of  Exhibit  7.19  shows  the  correlogram  of  the  series  A yt.  The  SACF 
indicates  the  presence  of  seasonal  effects.  Panels  3  and  4  contain  ADF  tests 
for  the  series  of  year-totals  zt  =  (1  +  L  +  L2  +  L3)yt.  The  null  hypothesis  of 
integration  of  order  2  is  rejected  (see  Panel  4,  the  t-value  is  —3.89  <  —2.89), 
and  the  null  hypothesis  of  integration  of  order  1  is  not  rejected  (see  Panel  3, 
the  f-value  is  —2.49  >  —3.44).  We  conclude  that  the  series  zt  is  integrated  of 
order  1.  So  the  series  (1  —  L)zt  =  (1  —  L4)yt  =  A4 yt  is  stationary.  This  result, 
together  with  the  seasonal  correlation  that  is  present  in  the  series  Ayt, 
motivates  using  a  model  with  stochastic  seasonal  for  the  series  yt.  This 
motivates  our  analysis  of  the  quarterly  series  of  annual  growth  rates  A4yf 
in  foregoing  sections. 

The  series  A4 yt  contains  no  trend  and  also  no  seasonal  effects,  as  is  clear 
from  the  SACF  in  Exhibit  7.8  (see  Example  7.7).  This  means  that  the  trend 
and  the  seasonal  are  both  eliminated  by  transforming  the  series  yt  to  the 
series  of  annual  growth  rates  S.4yt  =  (1  —  L4)yt.  Models  for  the  stationary 
series  A4 yt  were  discussed  in  Example  7.11. 

(iii)  Remarks  on  two  alternative  models 

It  is  left  as  an  exercise  (see  Exercise  7.16)  to  estimate  two  alternative 
models  —  namely,  an  AR(p)  model  with  deterministic  seasonals  for  the  series 
A yt  and  an  ‘airline’  model  for  the  series  yt.  The  models  can  be  compared  by 
diagnostic  tests  and  by  comparing  their  forecast  performance,  as  described  in 
Section  7.2.4. 

(iv)  Time  series  decomposition  obtained  by  the  Census  X-12  method 

Finally  we  describe  the  results  of  the  Census  X-12  method.  We  consider  an 
additive  model  for  the  series  yt  of  logarithms  of  the  industrial  production. 
This  corresponds  to  a  multiplicative  model  for  the  original  production  index 
series.  The  seasonally  adjusted  series  for  the  period  1980-94  is  shown  in 
Exhibit  7.20  (a).  The  estimated  seasonal  components  are  relatively  small,  as 
they  are  all  below  0.01 — that  is,  they  are  less  than  1  per  cent.  As  a  conse¬ 
quence,  the  seasonally  adjusted  series  is  very  close  to  the  original  one.  The 
estimated  seasonal  components  in  Panel  2  of  Exhibit  7.20  indicate  some 
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Panel  1:  ADF  Test  Statistic  —5.668179  5%  Critical  Value 

-2.8827 

Augmented  Dickey-Fuller  Test  Equation;  Dependent  Variable:  D(DY) 

Sample:  1961:1  1994:4;  Included  observations:  136 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

DY(-l) 

-0.786100 

0.138687 

-5.668179 

D(DY(  — 1)) 

0.070365 

0.134332 

0.523814 

D(DY(— 2)) 

0.160678 

0.114100 

1.408224 

D(DY(— 3)) 

-0.020794 

0.099616 

-0.208744 

D(DY(— 4)) 

0.322013 

0.079909 

4.029740 

C 

0.006605 

0.001925 

3.431268 

Panel  2:  Correlogram  of  DY 

Lag 

SACF 

SPACF 

Q-Stat 

Prob 

1 

0.112 

0.112 

1.7472 

0.186 

2 

0.212 

0.202 

8.0437 

0.018 

3 

-0.098 

-0.147 

9.3999 

0.024 

4 

0.325 

0.332 

24.430 

0.000 

5 

-0.269 

-0.372 

34.775 

0.000 

6 

0.059 

0.082 

35.272 

0.000 

7 

-0.258 

-0.159 

44.928 

0.000 

8 

0.091 

-0.009 

46.135 

0.000 

9 

-0.295 

-0.076 

58.988 

0.000 

10 

0.124 

0.061 

61.269 

0.000 

11 

-0.193 

-0.059 

66.869 

0.000 

12 

0.163 

0.057 

70.889 

0.000 

Panel  3:  ADF  Test  Statistic 

-2.486805 

5%  Critical  Value 

-3.4445 

Augmented  Dickey-Fuller  Test  Equation;  Dep.  Var.:  D(YEARSUMY) 
Sample(adjusted):  1962:2  1994:4;  Included  observations:  131 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

YEARSUMY  ( —  1 ) 

-0.016469 

0.006622 

-2.486805 

D(YEARSUMY(  — 1)) 

1.306777 

0.087579 

14.92109 

D(YEARSUMY(— 2)) 

-0.594846 

0.145276 

-4.094604 

D(YEARSUMY(— 3)) 

0.183889 

0.144221 

1.275052 

D(YEARSUMY(— 4)) 

-0.136339 

0.085423 

-1.596048 

C 

0.260355 

0.099926 

2.605466 

TREND 

0.000404 

0.000192 

2.102449 

Panel  4:  ADF  Test  Statistic 

-3.890236 

5%  Critical  Value 

-2.88371 

Augmented  Dickey-Fuller  Test  Equation;  Dep.  Var.:  D(DYEARSUMY) 
Sample(adjusted):  1962:3  1994:4;  Included  observations:  130 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

DYEARSUMY  ( —  1 ) 

-0.179666 

0.046184 

-3.890236 

D  (DYEARSUMY  ( —  1 ) ) 

0.572717 

0.078698 

7.277444 

D(DYEARSUMY(— 2)) 

-0.080749 

0.093386 

-0.864676 

D  (DYEARSUMY  ( — 3 ) ) 

0.239245 

0.090181 

2.652941 

D(DYEARSUMY(— 4)) 

-0.203484 

0.084778 

-2.400198 

C 

0.005857 

0.002188 

2.676967 

Exhibit  7.19  Industrial  Production  (Example  7.16) 

ADF  test  on  the  series  of  first  differences  Ayf  (denoted  by  DY  in  Panel  1;  here  Hq  :  yt  is  1(2)  is 
tested  against  H\:  yt  is  1(1)),  correlogram  of  \yt  (Panel  2),  ADF  test  on  the  quarterly  series  of 
year-totals  (Panel  3;  here  H0:  YEARSUMY  is  1(1)  is  tested  against  Hi:  YEARSUMY  is  1(0)), 
and  ADF  test  on  the  first  differences  DYEARSUMY  (Panel  4;  here  Ho:  YEARSUMY  is  1(2)  is 
tested  against  Hi:  YEARSUMY  is  1(1)). 
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4.3 

-I . : . 

80  82  84 

E 

. 1 

86  88  90  92  94 

- Y  .  YX12  | 

Panel  2:  Estimated  seasonal  components 

Year 

Quarter  1 

Quarter  2 

Quarter  3 

Quarter  4 

1980 

0.002 

-0.005 

0.009 

-0.005 

1981 

0.001 

-0.005 

0.009 

-0.005 

1982 

0.001 

-0.005 

0.009 

-0.005 

1983 

0.000 

-0.004 

0.008 

-0.004 

1984 

0.000 

-0.004 

0.008 

-0.003 

1985 

0.000 

-0.004 

0.007 

-0.003 

1986 

0.000 

-0.004 

0.007 

-0.002 

1987 

-0.001 

-0.004 

0.007 

-0.002 

1988 

-0.002 

-0.004 

0.008 

-0.002 

1989 

-0.004 

-0.003 

0.010 

-0.002 

1990 

-0.005 

-0.003 

0.011 

-0.003 

1991 

-0.005 

-0.002 

0.011 

-0.003 

1992 

-0.006 

-0.002 

0.011 

-0.003 

1993 

-0.005 

-0.002 

0.010 

-0.003 

1994 

-0.005 

-0.001 

0.009 

-0.003 

Panel  3:  Predicted  Seasonal  Components  one  year  ahead  (1995) 

Quarter 

1 

2 

3 

4 

Average 

Predicted  Seasonal 

-0.005 

-0.001 

0.009 

-0.003 

0 

Actual  Production  (in  logs) 

4.727 

4.731 

4.756 

4.746 

4.740 

Actual  Production  (level) 

112.9 

113.4 

116.3 

115.2 

114.5 

(Actual  -  Average)  /  Average 

-0.014 

-0.009 

0.016 

0.007 

0 

Exhibit  7.20  Industrial  Production  (Example  7.16) 

Seasonally  adjusted  series  obtained  by  Census  X-12  for  the  logarithmic  series  y  of  US  industrial 
production  ((a),  1980-94),  estimated  additive  seasonal  components  (Panel  2),  and  forecasts 
for  1995  together  with  the  actual  production  (Panel  3). 


trends  in  the  seasonal  components.  For  all  years,  the  effect  of  the  third 
quarter  is  positive  and  by  far  the  largest.  The  effects  of  the  second  and  fourth 
quarters  are  slightly  growing  over  time,  and  the  first  quarter  is  falling  back. 

Panel  3  of  Exhibit  7.20  shows  the  forecasts  for  the  four  quarters  in 
1995  for  the  series  yt.  As  the  series  consists  of  the  logarithms  of  industrial 
production,  we  conclude  that  the  production  level  is  predicted  to  be  around 
1.5  per  cent  higher  (0.009— (  —  0.005))  in  the  third  quarter  than  in  the  first 
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quarter  of  1995.  The  actual  production  indices  for  1995  are  also  given  in 
Panel  3  of  Exhibit  7.20,  which  shows  that  the  production  in  the  third  quarter 
of  1995  actually  was  around  3  per  cent  higher  (0.016— (—0.014))  than  that  in 
the  first  quarter.  Such  deviations  from  previous  patterns  may  be  of  interest  to 
predict  further  future  developments  in  industrial  production. 

Exercises:  T:  7. 2d,  7.5d;  S:  7.13c,  d;  E:  7.16a-e,  7. 17e-g,  7.18b,  f,  g. 


7.3.5  Summary 

Many  time  series  in  business  and  economics  are  characterized  by  trends 

and  seasonal  fluctuations. 

•  To  predict  future  developments  one  should  make  sure  that  the  trend  and 
the  seasonal  effects  in  the  time  series  are  modelled  in  an  appropriate 
way. 

•  A  time  series  with  deterministic  trend  has  the  following  properties.  The 
time  series  reverts  to  the  trend  line  in  the  long  run,  and  the  effect  of 
shocks  dies  out  when  time  progresses.  Forecast  intervals  are  equally 
wide  for  all  forecast  horizons. 

•  A  time  series  with  stochastic  trend  is  not  trend  reverting,  shocks  have  a 
permanent  effect  on  the  level  of  the  series,  the  variance  increases  over 
time,  and  forecast  intervals  become  wider  for  larger  horizons. 

•  The  (augmented)  Dickey-Fuller  test  can  be  used  to  test  the  null  hypoth¬ 
esis  of  a  stochastic  trend  against  the  alternative  hypothesis  of  a  deter¬ 
ministic  trend.  The  corresponding  ( F -  and  t -)  tests  do  not  have  a 
standard  distribution.  The  appropriate  distribution  depends  on  the 
inclusion  of  constant  term  and  trend  term  in  the  test  equation. 

•  Trends  and  seasonals  can  be  estimated  and  predicted  by  parametric 
models  with  deterministic  or  stochastic  trends  and  seasonal  compon¬ 
ents.  It  is  also  possible  to  use  other  methods  —  for  instance,  exponential 
smoothing  (EWMA),  Holt-Winters,  and  Census  X-12. 
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7.4  Non-linearities  and 
time-varying  volatility 

=®>  Uses  Chapters  1-4;  Section  5.5;  parts  of  Sections  5.3,  5.4;  Sections  7. 1-7.3. 


7.4.1  Outliers 

Additive  outliers 

Trends  and  seasonals  are  the  most  dominant  features  of  most  time  series  in 
business  and  economics.  In  addition  the  series  may  have  other  striking 
features.  For  example,  in  Example  7.1  we  considered  the  quarterly  series  of 
industrial  production,  and  Exhibit  7.1  shows  that  the  observations  in  1975.1 
and  1975.2  correspond  with  an  excessive  slump  in  US  industrial  production. 
Given  the  general  pattern  of  the  data  before  1975,  it  seems  unlikely  that  one 
could  construct  an  ARMA  model  with  trend  and  seasonals  that  could  fore¬ 
cast  these  exceptionally  large  negative  growth  rates.  Such  observations  are 
outliers.  If  one  does  not  correct  for  such  outliers,  they  may  have  an  excessive 
impact  on  parameter  estimates  and  forecasts. 

One  can  distinguish  different  types  of  outliers  in  time  series.  Suppose 
that 


yt  =  zt  +  SDt{z), 


where  zt  is  a  stationary  but  unobserved  time  series  and  Dt(z)  is  a  dummy 
variable  with  Dt(t)  =  1  and  Dt(z)  =  0  for  t  ^  t.  Then  the  observation  yx  (at 
time  t  =  t)  is  called  an  additive  outlier  and  8  is  the  size  of  the  outlier.  An 
additive  outlier  affects  the  measured  value  of  the  time  series  at  one  specific 
point  in  time  (t  =  t),  but  there  are  no  effects  on  the  observations  afterwards. 
If  additive  outliers  are  neglected  in  modelling,  then  this  may  have  serious 
effects.  For  instance,  suppose  that  zt  follows  a  stationary  AR(p)  process. 
Then  the  additive  outlier  at  time  t  =  z  affects  all  forecasted  values  of  yt  at 
times  t  =  z  +  1,  -  ■  ■  ,z  +  p.  So  not  only  will  the  model  produce  a  bad  forecast 
for  the  observation  yx,  but  also  the  forecasts  of  the  observations  yT+/  will  be 
affected  for  /  =  1,  ■  ■  ■  ,p.  So  this  may  produce  a  sequence  of  comparatively 
large  residuals.  As  a  consequence,  an  additive  outlier  affects  the  estimated 
parameters  (that  are  obtained  by  minimizing  the  squared  residuals)  and  the 
quality  of  a  sequence  of  forecasts.  Additive  outliers  may  also  affect  unit  root 
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tests,  in  the  sense  that  the  null  hypothesis  of  a  unit  root  is  too  easily  rejected 
(see  Exercise  7.14). 

Test  for  additive  outliers 

If  the  time  instant  t  of  the  potential  outlier  is  known,  one  can  apply  a  t-test 
for  the  significance  of  <5.  For  instance,  suppose  that  zt  follows  a  stationary 
AR (p)  process,  so  that  <p{L)zt  =  a  +  st.  If  we  substitute  zt  =  yt  —  5Dt(x),  it 
follows  that  the  relevant  test  equation  is 

p  P 

yt  =  a  +  ^2  $ jJt-i  +  ~  X]  ^ pt-j (T)  +  St 

;'= i  7=i 

Here  Df_y( t)  is  a  dummy  variable  that  has  the  value  one  at  time 
t  =  t  + ;  and  zero  otherwise.  This  is  a  regression  model  with  (2 p  +  2) 
regressors  (1,  yf_i,  •  •  • ,  yt-P,  Dt,  Dt-\,  ■  ■  ■  ,Dt-P)  and  (p  +  2)  parameters 
(a,  d,  </>!,•••, 4>p).  The  parameters  can  be  estimated  by  non-linear  least 
squares.  If  <5  is  significant,  this  indicates  the  possible  presence  of  an  additive 
outlier.  One  can  also  test  whether  the  outlier  is  of  this  specific  additive  type. 
For  this  purpose  estimate  the  unrestricted  model  with  (2 p  +  2)  parameters, 
and  test  the  p  parameter  restrictions  corresponding  to  the  additive  outlier 
model. 

Innovation  outliers 

Another  type  of  outlier  is  an  innovation  outlier  where  the  outlier  occurs  in 
the  innovation  process.  An  innovation  outlier  at  time  t  in  an  ARMA  model  is 
given  by 


myt  =  e(L)(st  +  5Dt(  t». 

For  instance,  for  an  AR (p)  model  this  gives 

yt  =  <f>iyt-\  +  ■  ■  ■  +  4>pyt-p  +  SDt(x)  +  st. 

This  shows  that  the  forecast  of  yt  is  affected  only  at  time  t  =  z.  So  an 
innovation  outlier  will  lead  to  a  single  large  residual  and  will  in  general 
have  less  severe  effects  on  parameter  estimates  than  an  additive  outlier.  An 
innovation  outlier  may  affect  unit  root  tests  (see  Exercise  7.14).  If  the  time  z 
of  a  possible  innovation  outlier  is  known,  one  can  apply  a  simple  t-test  on  the 
significance  of  3.  One  should  check  whether  neighbouring  residuals  are  not 
also  outliers,  as  otherwise  an  additive  outlier  may  be  more  appropriate. 
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Level  shifts 

If  the  series  contains  a  stochastic  trend,  then  an  innovation  outlier  has  a 
permanent  effect  on  the  level  of  the  series  (see  Exercise  7.14).  For  a  stationary 
series,  a  level  shift  at  time  t  =  z  can  be  modelled  by 

q HL)yt  =  a.  +  SDf  (t)  +  0(L)et. 

Here  Df  is  a  dummy  variable  with  Df(z)  =  0  for  t  <  t  and  Df(z)  =  1  for 
t  >z.  The  mean  of  the  series  yt  shifts  from  a/</>(l)  prior  to  the  level  shift  to 
(a  +  8)/4>(l)  afterwards.  Neglecting  such  a  shift  will  lead  to  a  sequence  of 
large  residuals  around  t  =  z.  If  the  time  z  of  the  potential  shift  is  known, 
the  presence  of  a  level  shift  can  simply  be  tested  by  the  t-test  on  <5.  In  some 
cases  the  shift  is  more  gradual  and  extends  over  several  time  periods.  This  can 
be  modelled  by  replacing  the  step  function  D+  (t)  by  a  more  smooth  function 
of  time. 

Diagnostic  checks 

In  practice  one  can  detect  possible  outliers  and  breaks  by  considering  plots  of 
the  observed  time  series  yt.  It  may  also  be  instructive  to  estimate  simple 
models  and  to  inspect  the  time  series  plot  and  the  histogram  of  the  resulting 
series  of  residuals.  Statistical  tests  can  be  applied  by  including  appropriate 
dummy  variables  in  the  model  and  by  testing  their  significance.  If  the  outliers 
cannot  be  modelled  in  an  acceptable  way,  then  a  possible  alternative  is  to  use 
robust  estimation  methods  that  assign  less  weight  to  outliers.  This  was 
discussed  in  Section  5.6.4  (p.  390). 

Example  7.17:  Industrial  Production  (continued) 

We  consider  again  the  quarterly  series  A4 yt  of  yearly  growth  rates  of  US  indus¬ 
trial  production.  We  will  discuss  (i)  a  graphical  inspection  of  outliers,  (ii)  a 
model  with  dummies  for  the  outliers,  and  (iii)  diagnostic  tests  for  this  model. 

(i)  Graphical  inspection  of  outliers 

Exhibit  7.1  (c)  in  Example  7.1  shows  a  time  plot  of  the  series  S.4yt  of  yearly 
growth  rates.  This  exhibit  shows  that  the  series  quite  often  takes  values  that 
are  far  apart  from  the  majority  of  observed  growth  rates.  In  Example  7.8  we 
estimated  an  AR(2)  model  for  this  series  (see  Exhibit  7.9),  and  in  Example 
7.11  we  applied  diagnostic  tests  (see  Exhibit  7.11).  The  residuals  of  this 
model  are  not  normally  distributed.  The  time  plot  of  the  residuals  is  given 
once  more  in  Exhibit  7.21  (a),  together  with  the  95  per  cent  confidence 
bounds.  This  shows  that  there  may  be  outliers  in  the  quarters  1961.1, 
1961.2,  1974.4,  1975.1,  1976.1,  1980.2,  and  1981.4. 
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(a) 


Panel  2:  Dependent  Variable:  D4Y 

Method:  Least  Squares 

Sample:  1961:1  1994:4;  Included  observations:  136 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

0.007652 

0.001687 

4.536416 

0.0000 

D4Y(— 1) 

1.318387 

0.057799 

22.80998 

0.0000 

D4Y(— 2) 

-0.515566 

0.057037 

-9.039201 

0.0000 

DUM611 

-0.068627 

0.015648 

-4.385536 

0.0000 

DUM612 

0.081021 

0.016277 

4.977700 

0.0000 

DUM744 

-0.047186 

0.015593 

-3.026035 

0.0030 

DUM751 

-0.054010 

0.015859 

-3.405648 

0.0009 

DUM761 

0.069084 

0.016130 

4.282876 

0.0000 

DUM802 

-0.052449 

0.015568 

-3.369090 

0.0010 

DUM814 

-0.067683 

0.015575 

-4.345596 

0.0000 

R-squared 

0.907760 

Mean  dependent  var 

0.032213 

Adjusted  R-squared 

0.901171 

S.D.  dependent  var 

0.049219 

S.E.  of  regression 

0.015473 

Akaike  info  criterion 

-5.428742 

Sum  squared  resid 

0.030167 

Schwarz  criterion 

-5.214576 

Log  likelihood 

379.1545 

F-statistic 

137.7771 

Durbin- Watson  stat 

2.035319 

Prob(F-statistic) 

0.000000 

(c) 


Exhibit  7.21  Industrial  Production  (Example  7.17) 

(a)  shows  the  time  plot  of  the  residuals  of  the  AR(2)  model  (without  dummies)  for  the 
series  A4yf  of  yearly  growth  rates  (see  also  Exhibit  7.11),  Panel  2  shows  the  results  of  the 
AR(2)  model  with  seven  outlier  dummies,  and  (c)  contains  the  histogram  of  the  residuals  of 
this  model. 
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(ii)  Model  with  outlier  dummies 

We  include  separate  dummy  variables  for  the  seven  possible  outlier  obser¬ 
vations  in  the  AR(2)  model  for  A4 yt.  The  resulting  estimates  are  shown  in 
Panel  2  of  Exhibit  7.21.  Each  of  the  dummies  is  highly  significant.  The 
estimated  autoregressive  parameters  (with  standard  errors  in  parentheses) 
are  dq  =  1.318  (0.058)  and  </>2  =  —0.516  (0.057).  If  we  compare  these 
outcomes  with  the  ones  of  the  original  AR(2)  model  in  Exhibit  7.9  of 
Example  7.8,  we  see  that  the  estimates  do  not  change  much,  but  that 
the  standard  errors  become  considerably  smaller  after  correction  for  the 
outliers  (the  standard  errors  in  Exhibit  7.9  are  0.072).  Also  the  standard 
error  of  regression  reduces  considerably,  from  s  =  0.021  in  the  model 
without  dummies  to  s  =  0.015  in  the  model  with  dummies.  If  the  AR(2) 
model  with  dummies  is  used  in  forecasting,  then  the  point  forecasts 
of  future  values  of  the  series  are  not  much  affected  but  forecast  intervals 
are  narrower. 

(iii)  Diagnostic  checks  on  model  with  outlier  dummies 

The  histogram  of  the  residuals  of  the  AR(2)  model  with  dummies  is  given  in 
Exhibit  7.21  (c).  The  assumption  of  a  normal  distribution  of  the  residuals  is 
not  rejected  (the  Jarque-Bera  test  has  P-value  0.64).  The  kurtosis  is  3.177, 
which  is  close  to  3,  whereas  the  kurtosis  of  the  residuals  of  the  AR(2)  model 
without  dummies  is  equal  to  5.896  (see  Panel  6  of  Exhibit  7.11  in  Example 
7.11).  This  indicates  that  the  non-normality  of  the  residuals  of  the  AR(2) 
model  may  well  be  caused  by  outliers  in  the  time  series.  It  is  left  as  an  exercise 
to  investigate  whether  some  sequences  of  successive  outliers  may  be  due  to 
additive  outliers  (see  Exercise  7.16).  Further  non-linear  aspects  of  this  time 
series  are  analysed  in  Example  7.18  in  the  next  section. 

Exercises:  S:  7.14e-h;  E:  7.16f,  7.23f. 


7.4.2  Time-varying  parameters 

Parameter  variations  in  ARMA  models 

A  sequence  of  outlying  observations  may  be  due  to  changes  in  the  ARMA 
parameters.  Such  changes  can  be  caused,  for  instance,  by  different  economic 
regimes.  For  example,  the  speed  of  dynamic  adjustments  of  an  economic 
process  during  a  recession  may  differ  from  the  speed  in  expansion  periods.  If 
expansions  are  more  common  than  recessions,  then  estimated  models  will  in 
general  perform  better  for  the  expansion  periods  and  may  produce  sequences 
of  outliers  for  recession  periods.  In  such  a  situation  the  model  should  be 
adjusted  by  allowing  the  parameters  to  vary  over  time. 


7.4  Non-linearities  and  time-varying  volatility  617 


In  Section  5.3  we  discussed  methods  to  model  parameter  variations  and  to 
test  for  the  presence  of  such  variations.  These  methods  —  for  instance,  the 
CUSUM  and  Chow  tests  —  can  also  be  applied  in  stationary  ARMA  models. 
This  is  straightforward  for  AR  models,  as  these  models  have  the  same 
structure  as  regression  models.  If  MA  terms  are  present,  then  the  Chow 
tests  can  be  performed  by  incorporating  appropriate  dummy  variables 
under  the  alternative  hypothesis  of  a  break.  The  recursive  residuals  needed 
for  the  CUSUM  test  can  be  obtained  from  a  sequence  of  ML  parameter 
estimates,  where  the  observations  ys  for  s  <  t  are  used  to  estimate  the 
model  at  time  (t  —  1)  to  make  a  one-step-ahead  forecast  for  yt. 


Threshold  model 

There  are  many  ways  to  specify  parameter  variations  in  ARMA  models.  For 
simplicity  we  consider  the  stationary  AR(2)  model  to  illustrate  some  possible 
models.  If  the  parameters  have  changed  at  a  single  and  known  time  moment 
{t  =  t),  then  this  can  be  modelled  by 


yt  =  at  +  ^uyt-i  +  <t>2iyt-i  +  d+(t)(  a2  +  (j>l2yt-i  +  <t>22 yt-2 )  +  £?■ 


Here  D+  is  a  dummy  variable  with  D+  (t)  =  0  for  t  <  x  and  D+ (t)  =  1  for  t  >  t.  So 
the  regime  switches  abruptly  at  the  time  instant  t  =  x.  It  may  also  be  that  the 
regime  is  determined  by  past  values  of  the  process  yt.  For  instance,  if  a  recession 
corresponds  to  a  negative  value  of  yt~ i,  then  the  regime  may  be  defined  by  the  sign 
(positive  or  negative)  of  yt~\.  A  possible  model  with  regime  switches  is  then 
obtained  by  replacing  the  dummy  variable  D+( t)  in  the  above  model  by  the 
dummy  variable  Dt{yt~\)  defined  by 

Dt(yt-i)  =  0  if  yt-i  >  0,  Dt(yt- 1)  =  1  if  yt-\  <  0. 

The  resulting  model  is 

yt =  cq  +  (PiiVt-i  +  4>2iyt-i  +  Dt(yt-i){ai  +  d’liyf-i  +  (t>22y >-2)  +  £t-  (7-26) 

This  is  called  the  threshold  autoregressive  (TAR)  model.  The  above  model  is  a 
TAR(2)  model  with  switching  variable  yt~\  and  with  threshold  value  zero  between 
the  two  regimes.  This  model  can  be  extended  to  ARMA  models  with  other 
switching  variables  and  other  threshold  values. 


Smooth  transition  model 

In  the  TAR  model  the  shifts  of  regime  are  abrupt,  because  the  switching  function 
Dt(yt~  1)  is  a  discontinuous  function  of  yt  i .  The  transitions  can  also  be  modelled 
more  smoothly  by  means  of  a  smooth  transition  autoregressive  (STAR)  model. 
The  STAR(2)  model  with  switching  variable  yt-\  is  given  by 
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yt  =  ai  +  4>wyt~\  +  +  F{yt~i){^i  +  <t>uyt-\  +  faiyt-i)  +  e(.  (7.27) 

Here  P  is  a  smooth  switching  function.  For  instance,  the  logistic  STAR(2) 
model  with  switching  variable  yt  \  is  defined  by  (7.27)  with  logistic  switching 
function 


F(yt- 1) 


i 

1  +  ' 


The  parameter  y1  determines  the  speed  of  the  parameter  switches  due  to  variations 
in  yt_\  around  y2.  For  y1  — »  oo  the  switching  function  becomes  very  steep  and  the 
model  converges  to  the  TAR  model  (7.26)  (with  y2  =  0).  If  y1  is  relatively  small, 
then  the  transitions  are  more  smooth,  and  if  =  0  the  parameters  do  not  change 
at  all.  A  Likelihood  Ratio  test  for  parameter  variation  of  this  type  is  obtained  by 
comparing  the  log-likelihood  value  of  the  non-linear  model  (7.27)  with  the  log- 
likelihood  of  the  linear  AR(2)  model.  One  can  also  apply  more  general  tests  for 
non-linearities  —  for  instance,  the  RESET  of  Section  5.2.2  (p.  285).  The  advantage 
of  the  STAR  model  is  that  it  gives  a  clear  economic  interpretation  of  parameter 
variations  in  terms  of  different  economic  regimes. 
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Example  7.18:  Industrial  Production  (continued) 

We  continue  our  analysis  of  the  quarterly  series  of  yearly  growth  rates  in  US 
industrial  production.  We  will  discuss  (i)  a  smooth  transition  model,  (ii)  a 
threshold  model,  and  (iii)  an  interpretation  of  the  threshold  model  for  this 
series. 

(i)  Smooth  transition  model 

In  Example  7.17  in  the  foregoing  section  we  detected  seven  outliers  in  the 
series  of  yearly  growth  rates  A4 yt  in  US  industrial  production.  Six  of  these 
outliers  fall  in  quarters  after  a  quarter  with  negative  growth  —  that  is,  at 
times  t  where  ^yt-i  <  0.  This  suggests  modelling  this  series  by  means  of  two 
different  regimes  to  allow  for  different  production  adjustments  in  recession 
periods.  Instead  of  the  AR(2)  model  considered  in  previous  examples,  we 
now  estimate  the  logistic  STAR(2)  model  for  ^yt  with  switching  variable 
A4yt_i.  The  results  are  reported  in  Panel  1  of  Exhibit  7.22.  The  value  of 
y2  —  2361  is  very  large.  As  the  switching  variable  has  mean  0.03  with 
standard  deviation  0.05,  the  large  value  of  leads  to  very  fast  transitions. 
However,  the  estimates  y,  and  y2  do  not  differ  significantly  from  zero. 

(ii)  Threshold  model 

Since  the  transitions  in  the  STAR  model  are  very  fast,  this  motivates  the  use 
of  a  threshold  (TAR)  model  with  two  regimes,  expansion  if  the  last  observed 
growth  rate  k^yt-i  is  positive  and  recession  if  this  growth  rate  is  negative.  We 
model  this  by  means  of  the  dummy  variable  D+,  which  has  value  1  during 
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Panel  1:  Dependent  Variable:  D4Y 

Method:  Least  Squares 

Sample:  1961:1  1994:4;  Included  observations:  136 

Convergence  achieved  after  98  iterations 

D4Y  =  C(1)+C(2)*D4Y(— 1)+C(3)*D4Y(— 2)  +  (C(4)+C(5)*D4Y(-1)+C(6) 
*D4Y(— 2))/(l+  @EXP(— C(7)*(D4Y(  — 1)— C(8)))) 

Parameter 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C(l) 

-0.006696 

0.005367 

-1.247760 

0.2144 

C(2) 

1.174642 

0.138134 

8.503617 

0.0000 

C(3) 

-0.711489 

0.113235 

-6.283286 

0.0000 

C(4) 

0.015946 

0.006978 

2.285306 

0.0239 

C(5) 

0.069069 

0.172454 

0.400509 

0.6894 

C  (6) 

0.244899 

0.145274 

1.685779 

0.0943 

C(7)  (=7l) 

2360.688 

10225.55 

0.230862 

0.8178 

C(8)  (=  y,) 

0.002389 

0.001865 

1.281076 

0.2025 

R-squared 

0.836230 

Mean  dependent  var 

0.032213 

Adjusted  R-squared 

0.827274 

S.D.  dependent  var 

0.049219 

S.E.  of  regression 

0.020456 

Akaike  info  criterion 

-4.884093 

Sum  squared  resid 

0.053559 

Schwarz  criterion 

-4.712760 

Log  likelihood 

340.1183 

Durbin- Watson  stat 

2.172684 

Panel  2:  Dependent  Variable:  D4Y 

Method:  Least  Squares 

Sample:  1961:1  1994:4;  Included  observations:  136 

D4Y=C(1)+C(2)*D4Y(— 1)+C(3)*D4Y(— 2)  +  DUMPLUS*(C(4)+C(5)* 

D4Y(  — 1)+C(6)*D4Y(— 2)) 

Parameter 

Coefficient 

Std.  Error  t-Statistic 

Prob. 

C(l) 

-0.006937 

0.005423  -1.279223 

0.2031 

C(2) 

1.172922 

0.138174  8.488734 

0.0000 

C(3) 

-0.714031 

0.112871  -  6.326091 

0.0000 

C(4) 

0.014694 

0.006739  2.180352 

0.0310 

C(5) 

0.092095 

0.170993  0.538589 

0.5911 

C  (6) 

0.247076 

0.144471  1.710210 

0.0896 

R-squared 

0.834945 

Mean  dependent  var 

0.032213 

Adjusted  R-squared 

0.828597 

S.D.  dependent  var 

0.049219 

S.E.  of  regression 

0.020377 

Akaike  info  criterion 

-4.905686 

Sum  squared  resid 

0.053980 

Schwarz  criterion 

-4.777186 

Log  likelihood 

339.5866 

Durbin- Watson  stat 

2.158046 

(c)  (d) 


|  - RESTAR2  . DUMPLUS 


Exhibit  7.22  Industrial  Production  (Example  7.18) 


STAR(2 )  model  (Panel  1 )  and  TAR(2)  model  (Panel  2)  for  quarterly  series  of  yearly  growth  rates 
in  US  industrial  production;  (c)  shows  the  histogram  of  the  TAR(2)  residuals;  (d)  shows  the  time 
plot  of  these  residuals  together  with  the  expansion  dummy  D+  (denoted  by  DUMPLUS). 
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expansions  (t±4yt-\  >  0)  and  value  0  during  recessions  (A4yf_i  <  0).  The 
results  of  the  TAR(2)  model  are  shown  in  Panel  2  of  Exhibit  7.22.  The 
TAR(2)  model  is  preferred  above  the  STAR(2)  model  on  the  basis  of  AIC 
and  SIC.  We  can  also  compare  these  two  models  by  the  LR- test,  as  the  TAR 
corresponds  to  the  two  parameter  restrictions  =  00  and  y2  =  0  in  the 
STAR  model  (in  this  case  F(z)  =  1  for  z  >  0  and  F[z)  =  0  for  z  <  0,  so  that 
T(A4yf_i)  =  D+).  The  test  has  value  LR  =  2(340.12  -  339.59)  =  1.06,  and 
the  P-value  corresponding  to  the  /2(2)  distribution  is  P  =  0.59.  So  the 
TAR(2)  model  is  not  rejected.  The  TAR(2)  model  can  also  be  tested  against 
the  alternative  of  the  AR(2)  model  with  constant  coefficients.  The  AR(2) 
model  is  obtained  from  the  TAR(2)  model  (7.26)  by  imposing  the  three 
parameter  restrictions  (x2  =  4>  12  =  4*22  =  0-  The  AR(2)  model  was  estimated 
in  Example  7.8  (see  Exhibit  7.9).  If  we  compare  the  log-likelihood  values  of 
both  models,  we  obtain  LR  =  2(339.6  —  334.2)  =  10.8,  with  P-value  corres¬ 
ponding  to  the  x2(3)  distribution  equal  to  P  =  0.013.  So  the  LR- test  rejects 
the  AR(2)  model,  and  we  conclude  that  the  adjustment  process  in  recessions 
differs  from  that  in  expansions. 

(iii)  Interpretation  of  the  threshold  model 

The  estimated  TAR(2)  model  in  Panel  2  of  Exhibit  7.22  can  be  used 
to  estimate  the  mean  growth  rates  of  industrial  production  during 
expansion  and  recession  periods.  The  mean  growth  in  recession  periods 
(where  D+  =  0)  is  equal  to  —0.007/(1  —  1.17  +  0.71)  = —1.3  per  cent, 
whereas  in  expansion  periods  (where  D+  =  1)  it  is  0.008/(1  —  1.26  +  0.47) 
=  +3.8  per  cent.  Exhibit  7.22(d)  shows  the  plot  of  the  expansion  dummy  D+. 
This  shows  that  expansions  last  for  longer  periods  as  compared  to  recessions. 
The  recessions  in  the  periods  1961,  1974-5,  and  1980-2  contain  the  obser¬ 
vations  that  were  earlier  detected  as  outliers  in  the  AR(2)  model  in  Example 
7.17.  Exhibit  7.22  (c)  and  ( d )  show  the  residuals  of  the  TAR(2)  model.  These 
residuals  still  contain  some  outliers  in  recession  periods  (where  D/  =  0). 
Normality  of  the  TAR(2)  residuals  is  rejected  because  the  kurtosis  is  still 
large  (with  a  value  of  5.503,  as  compared  to  5.896  in  the  AR(2)  model  in 
Panel  6  of  Exhibit  7.11). 

Although  the  TAR  model  is  not  completely  satisfactory  from  a  statistical 
point  of  view,  it  is  of  economic  interest  as  it  distinguishes  between  recessions 
and  expansions. 


7.4.3  GARCH  models  for  clustered  volatility 

Changing  variance  in  time  series 

In  the  foregoing  sections  we  considered  lagged  effects  on  the  level  E[yt\  Yf_i] 
of  the  observed  time  series.  It  has  been  assumed  until  now  that  the  innov- 
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ations  et  all  have  the  same  variance  a1.  This  is  not  always  realistic,  as  there 
may  exist  lagged  effects  in  the  conditional  variance  07  =  var(y?|  Yf_i)  of  the 
time  series.  As  the  variance  is  a  measure  of  the  uncertainty  or  risk  on  future 
values  of  the  variable,  this  is  of  importance  for  decisions  in  business  and 
economics  that  involve  risk.  For  instance,  in  finance  the  price  of  options  and 
other  financial  instruments  depends  on  the  variance  or  volatility  of  price 
movements  in  the  market.  Further,  if  the  variance  changes  over  time,  then  an 
appropriate  model  for  the  variance  also  leads  to  more  accurate  forecast 
intervals,  with  wider  intervals  in  risky  periods  and  narrower  intervals  in 
stable  periods. 

Empirical  evidence  for  changing  variance 

Many  time  series  in  business  and  economics  exhibit  changes  in  volatility  over 
time.  This  especially  holds  true  for  many  financial  time  series.  As  an  illustra¬ 
tion,  in  Example  7.2  we  considered  the  daily  Dow-Jones  index.  The  results  in 
Example  7.15  indicate  that  the  series  of  daily  returns  A  log  (DJ)  is  uncorrel¬ 
ated  (see  Exhibit  7.18).  That  is,  the  past  returns  contain  no  information  on 
the  returns  of  tomorrow.  Exhibit  7.2  (c)  in  Example  7.2  shows  the  time  plot 
of  the  returns.  This  shows  that  the  variance  in  the  returns  changes  over  time. 
There  exist  quiet  periods  with  relatively  small  returns,  but  also  very  volatile 
periods  where  large  positive  and  negative  returns  follow  each  other.  This 
property  is  called  clustered  volatility.  In  this  case  the  variance  or  risk  in  the 
returns  can  be  predicted  to  some  extent.  Exhibit  7.2  further  indicates  that 
there  are  relatively  many  large  positive  and  negative  returns  that  cannot  be 
modelled  well  by  the  normal  distribution. 

Many  financial  time  series  have  the  above  properties  —  that  is,  no  auto¬ 
correlation  in  the  level  (white  noise),  time-varying  variance  (clustered  vola¬ 
tility),  and  distributions  with  excess  kurtosis  (fat  tails).  In  this  section  we 
describe  time  series  models  that  account  for  these  properties. 

Autoregressive  conditional  heteroskedasticity  (ARCH) 

If  the  variance  of  a  time  series  depends  on  the  past,  we  say  that  the  series  is 
conditionally  heteroskedastic.  If  this  dependence  on  the  past  can  be  ex¬ 
pressed  by  an  autoregression,  this  gives  the  so-called  model  with  autoregres¬ 
sive  conditional  heteroskedasticity,  abbreviated  as  ARCH.  For  instance,  the 
ARCH(  1 )  model  for  a  white  noise  series  yt  is  given  by 

yt  =  /i  +  st,  ef|Yf_!  ~N(0,  of),  of  =  a0  +  u.\tdt_v  (7.28) 

Here  07  =  var(yf|Yt_i)  is  the  conditional  variance  of  the  series,  where 
Yf_  1  =  {yt~  1,  yt-i,  ■  ■  ■}  denotes  the  available  information  set  at  time  t  —  1. 
As  variances  are  non-negative,  we  impose  the  conditions  that  ao  >  0  and 
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ai  >  0.  If  ai  >  0,  then  the  conditional  variances  are  positively  related,  as  of  is 
larger  for  larger  values  of  the  previous  innovation  ef_i.  This  makes  it  possible 
to  predict  the  variance  of  the  time  series  —  that  is,  the  risk  that  is  involved  in 
future  movements  in  the  series. 

Properties  of  ARCH  processes 

It  is  left  as  an  exercise  (see  Exercises  7.6  and  7.7  (a))  to  prove  that 
the  ARCH(l)  process  (7.28)  has  the  following  properties  if  ai  <  1.  It  is 
a  white  noise  process  with  E[{yt  —  f-i)(ys  —  n)]  =  0  for  all  t  ^  s.  The  mean 
is  E[yt ]  =  ji.  For  0  <  ai  <  1  the  (unconditional)  variance  E[(yt  —  /i)2]  = 
ao/(l  —  ai)  is  constant  over  time,  and  for  ol\  =  1  the  process  does  not  have 
a  finite  (unconditional)  variance  anymore.  Further,  the  series  of  squared 
innovations  sj  follows  an  AR(1)  process  —  that  is, 

£2  =  Ko  +  OC 1  T'7—  1  T  Vti 

where  vt  =  e}  —  of  is  a  white  noise  process.  This  implies  that  the  volatilities 
are  clustered  if  ai  >  0.  Finally,  the  (unconditional)  distribution  of  et  is  not 
normal  and  has  kurtosis  larger  than  3.  So  the  ARCH(l)  process  has  the  three 
aforementioned  properties  of  many  financial  time  series  —  that  is,  it  is  a  white 
noise  process  with  clustered  volatility  and  with  fat  tails.  Simulation  evidence 
of  these  properties  will  be  given  in  Example  7.19  at  the  end  of  this  section. 

In  practice,  extensions  of  the  ARCH(l)  model  are  needed  because  the 
squared  innovations  sj  often  show  correlation  patterns  that  cannot  be  mod¬ 
elled  well  by  an  AR(1)  model.  The  ARCH(p)  model  hasp  lags  in  (7.28),  so  that 

of  =  E[ej  |  Yf_i]  =  a0  +  ai H - b  ccpE‘t :_p .  (7.29) 

It  is  left  as  an  exercise  (see  Exercise  7.6)  to  show  that  in  an  ARCH(p)  model 
the  squared  innovations  sj  follow  an  AR(p)  process. 

Generalized  ARCH  models  (GARCH) 

Still  more  general  correlation  patterns  are  obtained  by  using  ARMA  models 
for  the  series  sj.  This  leads  to  the  class  of  so-called  generalized  ARCH  models 
(abbreviated  as  GARCH).  For  instance,  the  GARCH(  1,1 )  model  is  described  by 

a]  =  oc0  +  oc1ef_1  +  oc2crj_1, 

with  all  three  parameters  non-negative.  As  before,  let  vt  =  —  of;  then  vt  is 

a  white  noise  process  (see  Exercise  7.6).  By  substituting  <rf  =  tf  —  vt  in  the 
above  equation  we  get 
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£?  =  olq  +  (ai  +  a2)e^_1  +  —  a  2v(_i. 

So  the  process  £?  follows  an  ARMA(1,1)  process  with  AR  polynomial 
4>(z)  =  1  —  (ai  +  a2)z  and  MA  polynomial  0(2:)  =  1  —  a2z.  The  process  £?  is 
stationary  if  +  a2  <  1,  and  it  is  integrated  of  order  1  if  ai  +  a2  =  1.  In  the 
latter  case  the  (unconditional)  variance  of  the  innovations  s ?  is  not  constant 
over  time  if  ao  >  0,  because  E[sj ]  =  ao  +  (ai  +  a2)£[£^_1]  =  ao  +  EJfijLj]  in 
this  case.  The  general  GARCH(p,  q)  model  contains  p  lags  of  of  and  q  lags  of 
£?  in  the  above  equation  for  the  conditional  variance  of.  Then  £2  follows  an 
ARMA(ra,  p)  process  where  m  =  max  (p,  q). 

Combined  models  for  level  and  variance 

The  foregoing  models  are  pure  (G)ARCH  processes  —  that  is,  the  variance  of 
yt  is  predictable  but  the  level  of  yt  is  not  predictable  (as  yt  is  a  white  noise 
process).  We  can  also  combine  an  ARMA  model  for  the  level  of  yt  with 
a  GARCH  model  for  the  variance  of  the  innovations  et  =  yt  —  E[yt\Yt-i\. 
For  instance,  an  AR(1)-ARCH(1)  model  is  described  by  the  equation 
yt  =  a  +  4>yt- 1  +  et  for  the  level,  where  Et\Yt-i  ~  N(0,  of)  with  variance  equa¬ 
tion  of  =  ao  +  ai£jL , .  This  model  has  four  parameters  and  can  be  described  by 

yt\ Tf-i  ~  N(a  +  4>yt- 1,  =  N(a  +  <j>yt-i,  a0  +  ai(y(_i  -  a  -  (/>yt_2)2). 

(7.30) 

In  a  similar  way  we  can  formulate  mixed  ARMA-GARCH  models.  In  some 
cases  it  is  also  of  interest  to  consider  clustered  volatility  models  for  the 
error  terms  in  regression  models  yt  =  x'tft  +  et.  For  example,  suppose  that 
yt  =  a  +  pxt  +  £f  and  that  £t|  Yf_j  ~  N(0,  o?),  where  of  =  ao  +  Then 

we  get 

yt\Yt-i  ~  N(a  +  fixt,  o?)  =  N(a  +  fixt,  a0  +  ct\{yt-\  -  a  -  fixt- 1)2). 

The  use  of  GARCH  error  terms  in  regression  models  will  be  illustrated  in 
Example  7.22. 

Example  7.19:  Simulated  ARCH  and  GARCH  Time  Series 

We  illustrate  the  above  results  by  means  of  two  simulations.  The  first  time 
series  is  generated  by  the  ARCH(l)  process  (7.28)  with  parameter  values 
f.i  =  0,  ao  =  1,  and  ai  =  0.9.  So  yt  =  st  with  yt\ Yf_i  ~  N(0,  of)  where 
of  =  I  +  0.9y2_| .  This  series  is  simulated  as  follows.  First  we  draw 
white  noise  terms  rjt  ~  NID(0,  1),  and  then  the  ARCH(l)  process  yt  is 
generated  by  recursively  computing  yt  =  atqt  where  o2  =  1  +  0.9yj_1.  The 
results  are  in  Exhibit  7.23  (a-f).  The  series  yt  shows  clustered 


E 


624 


7  Time  Series  and  Dynamic  Models 


(a) 


-2  -  I 

100  200  300  400  500 

|  -  INNOV  | 

(c) 

4l|(4(lh 

1 

100  200  300  400  500 

| - ARCH1  | 

(e) 

8  - 

6  -  *  i 

:  ill  ki  tJ 

JlliJw 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1  ■  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1  ■  1 1 1 1 1 1 1 1 1 1 

100  200  300  400  500 


| -  SDARCH1 


(b) 


(d) 


(i f ) 


Lag 

SACF  yt 

SACF  yf 

SPACF  yf 

1 

0.112 

0.587 

0.587 

2 

0.070 

0.353 

0.013 

3 

0.056 

0.229 

0.026 

4 

-0.068 

0.164 

0.025 

5 

-0.072 

0.181 

0.105 

6 

-0.042 

0.100 

-0.086 

7 

-0.015 

0.069 

0.021 

8 

0.030 

0.071 

0.034 

9 

-0.081 

0.040 

-0.029 

10 

0.055 

0.049 

0.024 

Exhibit  7.23  Simulated  ARCH  and  GARCH  Time  Series  (Example  7.19) 


Time  series  simulated  by  an  ARCH(l)  process,  innovations  iit  (denoted  by  INNOV  ((a)-(b))), 
ARCH(l)  series  yt  =  a,ii,  (denoted  by  ARCH1  ((c)-(d))),  time  plot  of  conditional  standard 
deviation  a,  (where  of  =  1  +  0.9yf_j  (e)),  and  S(P)ACF  of  the  ARCH(l)  series  yt  and  of  the 
squared  series  yj  (f). 
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Exhibit  7.23  (Contd.) 

Time  series  simulated  by  a  GARCH(1,1)  process,  innovations  r\t  (denoted  by  INNOV 
((g)-(h))),  GARCH(1,1)  series  yt  =  ot>]t  (denoted  by  GARCH1  ((i)-(j))),  time  plot  of  condi¬ 
tional  standard  deviation  at  (where  of  =  1  +  CUyf  j  +  0.7of_j  ( k )),  and  S(P)ACF  of  the 
GARCH(1,1)  series  yt  and  of  the  squared  series  y\  (l). 
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volatility  (see  (c))  and  has  a  kurtosis  of  5.38  (see  (d)).  The  exhibit  also  shows 
the  series  of  simulated  conditional  standard  deviations  at  (in  (e)),  as  well  as 
the  correlograms  of  the  series  yt  and  of  the  squared  series  y\  (in  (f)).  The 
SPACF  of  the  series  yf  indeed  indicates  that  this  is  an  AR(1)  process. 

In  the  second  simulation  we  generate  a  time  series  from  the  GARCH(1,1) 
model  with  parameter  values  f.i  =  0,  ao  =  1,  a.\  =  0.2,  and  ai  =  0.7.  The 
results  are  in  Exhibit  7.23  (g—l).  Again,  the  simulated  series  is  white  noise, 
it  has  clustered  volatility,  and  it  has  excess  kurtosis. 

'’a?  Exercises:  T:  7.6,  7.7a. 


7.4.4  Estimation  and  diagnostic  tests  of  GARCH  models 

Two-step  estimation  of  ARMA-GARCH  models 

If  the  process  yt  follows  a  GARCIT  process  with  mean  p,  then  the  squared 
series  (yt  —  g)2  follows  an  ARMA  process.  So  the  parameters  of  a  GARCH 
model  can  be  estimated  by  estimating  an  ARMA  model  for  the  series 
(yt  —  g)1  by  the  methods  discussed  in  Section  7.2.2.  If  the  mean  ji  is  un¬ 
known,  it  can  be  replaced  by  the  sample  mean  y.  More  generally,  if  the 
process  yt  follows  an  ARMA  process  with  innovations  that  are  GARCH, 
then  the  model  parameters  can  be  estimated  in  two  steps.  In  the  first  step  the 
parameters  of  the  ARMA  model  for  yt  are  estimated  as  discussed  in  Section 
7.2.2.  Let  the  residuals  of  the  ARMA  model  be  denoted  by  et\  then  in  the 
second  step  the  parameters  of  the  GARCH  model  are  estimated  by  estimating 
an  ARMA  model  for  the  series  e j  of  squared  residuals.  For  example,  for  the 
AR(1)-ARCH(1)  model  the  first  step  consists  of  a  regression  of  yt  on  a 
constant  and  yt~i,  and  the  second  step  of  a  regression  of  the  squared  residual 
ef  on  a  constant  and  ef_ , .  This  two-step  method  provides  consistent  estima¬ 
tors,  but  they  are  not  efficient.  This  is  because  the  error  terms  are  not 
normally  distributed.  For  instance,  in  the  AR(1)-ARCH(1)  model,  the 
error  terms  et  in  the  AR(1)  model  for  yt  are  not  normally  distributed,  so 
that  the  AR  parameters  are  not  estimated  efficiently  in  the  first  step.  Also  the 
error  terms  vt  =  e\  —  of  in  the  AR  model  for  ej  are  not  normally  distributed, 
so  that  the  ARCH  parameters  in  the  second  step  are  also  not  efficiently 
estimated. 


Estimation  by  maximum  likelihood 

Consistent  and  efficient  estimates  are  obtained  by  applying  maximum  likelihood 
in  correctly  specified  models.  As  an  illustration  we  derive  the  log-likelihood  for  the 
AR(1)-ARCH(1)  model;  the  likelihood  function  for  other  models  can  be  obtained 
in  a  similar  way.  The  likelihood  derived  in  Section  7.2.2  for  ARMA  models  does 
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not  apply  in  this  case,  because  the  innovations  zt  are  no  longer  normally  distrib¬ 
uted.  To  express  the  log-likelihood  we  use  the  fact  that  a  joint  density  function  can 
be  factorized  as  f{z\,  Zi)  =  f(zi)f(zi\z\),  where  f{zi\z\)  is  the  conditional  density 
of  zi  conditional  on  Zi.  So  the  likelihood  function  can  be  factorized  as 

L  =  p(y  1,  yi,---,  yn )  =  p(y  1)  p(yi\yi)  p(y 3\yi,  yi)- ■ -p(yn\Yn-i). 

These  conditional  densities  are  normal  (see  (7.30)),  so  that 


n 

log (T(a,  <p,  0£0,  at))  =^2logp(yt\Y,-1) 

t=  1 


H  \  n 

=  -  y  log  (27t)  -  y  ^2  lo§ 


t=  1 


-i  n 
1  £1 

2^_ 


(7^ 
t=  1  t 


1  ” 

log(27t)--^log(a0  +  ai(yf_1  -a- 
z  r=i 


Ht-if) 


ly'  (yt  -  a  -  <pyt-i)2 

^^ao  +  ailyr-i  -  %  ~  Ht-l)2 


(7.31) 


Because  the  values  of  (yf,  t  <  0)  are  unobserved,  one  often  maximizes  the  condi¬ 
tional  log-likelihood  (where  the  observations  y\  and  y2  are  treated  as  fixed). 
In  this  case  the  summations  in  (7.31)  start  at  t  =  3  instead  of  t  =  1.  The  ML 
estimators  have  the  usual  asymptotic  properties  if  the  ARMA  process  yt  and 
the  GARCH  model  are  both  stationary.  For  instance,  the  ML  estimators  of  the 
AR(1)-ARCH(1)  model  are  asymptotically  normally  distributed,  provided 
that  —!<(/)<!  and  0  <  <  1. 


Test  for  the  presence  of  conditional  heteroskedasticity 

Correlations  in  the  variance  of  a  series  can  be  exploited  to  forecast  future 
risks  and  to  adjust  the  width  of  forecast  intervals.  Therefore  it  is  of  interest  to 
test  for  the  presence  of  GARCH  effects.  This  can  be  done,  for  instance,  by 
estimating  an  ARMA-GARCH  model  and  applying  an  LR- test  or  an  F- test 
on  the  joint  significance  of  the  GARCH  parameters.  These  tests  have  the 
usual  asymptotic  distributions  if  the  ARMA  model  and  the  GARCH  model 
are  both  stationary. 

The  Lagrange  Multiplier  test  is  somewhat  simpler  to  perform.  In  this  case 
we  need  to  estimate  an  ARMA  model  only  under  the  null  hypothesis  that  no 
GARCH  effects  are  present,  so  that  we  can  apply  the  estimation  methods  of 
Section  7.2.2.  As  an  example,  suppose  that  we  wish  to  test  the  following 
hypothesis  on  the  disturbance  terms  et  of  an  ARMA  model.  The  null 
hypothesis  states  that  the  terms  st  are  independent,  so  that  £t\Yt~\  ~ 
N(0,  a2).  The  alternative  hypothesis  is  that  the  terms  st  are  conditionally 
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heteroskedastic  according  to  the  ARCH(p)  process  (7.29),  so  that 
var(ef|Yf_i)  =  a2  =  ao  +  X)fe=iafeet-fc'  Then  the  null  hypothesis  of  (condi¬ 
tional)  homoskedasticity  corresponds  to  the  p  parameter  restrictions 

Ho  :  oci  =  •  •  ■  =  ccp  =  0. 

It  is  left  as  an  exercise  (see  Exercise  7.7)  to  derive  that,  in  an  ARMA  model 
for  yt,  the  LM- test  for  ARCH(p)  disturbances  can  be  computed  by  the 
following  steps. 


LM-test  for  ARCHfpI  error  terms 

•  Step  1:  Estimate  the  ARMA  model.  First  estimate  the  ARMA  model  by 
ML  —  that  is,  by  OLS  (for  an  AR  model)  or  by  NLS  (if  MA  terms  are 
present).  Let  et  be  the  corresponding  series  of  (OLS  or  NLS)  residuals. 

•  Step  2:  Estimate  the  ARCH  model.  Regress  the  squared  residuals  e 2  on  a 

constant  and  ef_p. 

•  Step  3:  LM  =  nR1.  Then  the  LM- test  can  be  computed  as  LM  =  nR2  of 
the  regression  in  step  2.  Under  the  null  hypothesis  that  no  ARCH  is  present, 
the  LM-test  asymptotically  has  the  %2(p)  distribution,  and  the  null  hypoth¬ 
esis  is  rejected  for  large  values  of  the  LM-test. 


ARCH  LM- test  as  a  general  test  for  non-linearities 

A  significant  value  for  LM  =  nR 1  in  the  ARCH  LM-test  does  not  necessarily 
imply  that  GARCH  is  the  correct  alternative  model.  A  significant  R2  for  the 
squared  residuals  may  also  be  caused,  for  example,  by  unmodelled  non- 
linearities  in  the  functional  relation  (similar  to  the  RESET)  or  by  clusters  of 
outliers.  Therefore,  the  ARCH  LM- test  can  also  be  used  as  a  general  test  for 
possible  non-linearities  in  the  time  series.  If  the  interpretation  of  clustered 
volatility  is  attractive  —  for  instance,  in  financial  applications  —  then  one  can 
estimate  a  GARCH  model.  In  other  situations  one  may  find  other  non-linear 
models  more  useful  —  for  instance,  the  models  discussed  in  Sections  7.4.1 
and  7.4.2. 

Standardized  residuals  as  diagnostic  tool 

The  purpose  of  a  GARCH  model  is  to  represent  changes  in  the  variance.  To 
check  whether  the  volatility  clustering  is  modelled  correctly,  let  et  denote  the 
series  of  residuals  of  the  estimated  ARMA-GARCH  model.  The  so-called 
standardized  residuals  are  defined  by  et / by,  where  a2  is  the  estimated  condi¬ 
tional  variance.  If  the  model  is  correct,  the  standardized  residuals  should  be 
approximately  uncorrelated  and  normally  distributed  with  constant  (condi- 
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tional)  variance.  This  can  be  checked  by  applying  the  Jarque-Bera  test  for 
normality  and  the  ARCH  test  for  absence  of  heteroskedasticity  on  the  series 
of  standardized  residuals. 


Use  of  GARCH  models  in  risk  modelling 

In  a  GARCH  model  the  conditional  variance  changes  over  time,  so  that  the 
forecast  accuracy  will  also  vary  over  time.  The  forecast  interval  can  be  based 
on  the  point  forecast  plus  or  minus  2 at.  In  very  volatile  periods,  where  the 
residuals  et  are  large,  the  estimated  conditional  variances  of  will  also  be 
relatively  large,  so  that  the  forecast  intervals  are  wider.  This  reflects  the  fact 
that  in  such  periods  there  is  more  uncertainty  about  the  future  values  of  yt. 
That  is,  large  values  of  at  correspond  to  periods  with  higher  risk.  In  Example 
7.21  below  we  consider  the  use  of  GARCH  models  in  predicting  future  risks. 


Example  7.20:  Industrial  Production  (continued) 

We  continue  our  analysis  of  the  quarterly  series  of  yearly  growth  rates  in  US 
industrial  production  (see  also  Examples  7.17  and  7.18).  We  will  discuss 
(i)  results  of  an  AR-ARCH  model  and  (ii)  some  diagnostic  tests  of  this 
model. 

(i)  AR(2)-ARCH(1)  model 

In  Example  7.17  we  observed  that  some  of  the  outliers  in  the  yearly  growth 
rates  of  US  industrial  production  appear  in  clusters.  Therefore  we  apply  a  test 
for  the  possible  presence  of  volatility  clustering.  As  the  growth  rates  show 
correlations,  we  do  not  estimate  a  pure  GARCH  model  for  this  series.  We 
model  the  growth  rates  by  an  AR(2)  model.  This  model  was  estimated  in 
Example  7. 8  (see  Exhibit  7. 9 ) ,  but  for  convenience  the  estimated  AR( 2 )  model 
is  shown  once  more  in  Panel  1  of  Exhibit  7.24.  The  ARCH  LM- test  for 
ARCH(l)  effects  in  the  disturbances  of  the  AR(2)  process  is  obtained  as 
LM  =  nR2  of  the  regression  of  the  squared  residuals  e2  of  this  model  on  a 
constant  and  the  lagged  squared  residuals  e2_v  This  is  shown  in  Panel  2  of 
Exhibit  7.24.  The  LM- test  has  value  LM  =  nR 2  =  135  ■  0.063  =  8.5  with 
P-value  P  =  0.003  (one  observation  is  lost  of  the  136  available  residuals, 
because  of  the  term  e2_]  in  the  test  equation).  We  conclude  that  the  residuals 
contain  significant  ARCH  effects. 

Panel  3  of  Exhibit  7.24  shows  the  ML  estimates  of  the  AR(2)-ARCH(1) 
model  for  ^yt-  The  AR  parameters  do  not  change  much  as  compared  to  the 
AR(2)  model  without  ARCH,  but  the  ARCH  parameters  are  of  interest  in 
predicting  uncertainties  in  US  industrial  production.  The  ARCH  parameter 
ai  =  0.304  is  relatively  small  but  significant  (P  =  0.016).  Also  the  LR- test  on 
ARCH  has  a  significant  value,  as  the  results  in  Panels  1  and  3  of  Exhibit  7.24 
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Panel  1:  Dependent  Variable:  D4Y 

Method:  Least  Squares 

Sample:  1961:1  1994:4;  Included  observations:  136 

Variable 

Coefficient 

Std.  Error  t- Statistic 

Prob. 

C 

0.007147 

0.002161  3.307447 

0.0012 

D4Y(— 1) 

1.332025 

0.072094  18.47633 

0.0000 

D4Y(— 2) 

-0.545933 

0.072174  -7.564120 

0.0000 

R-squared 

S.E.  of  regression 

0.821380 

0.020958 

Log  likelihood 

334.2160 

Panel  2:  ARCH  Test: 

F-statistic 

8.973384 

Probability 

0.003268 

Obs*R-squared 

8.532634 

Probability 

0.003488 

Test  Equation:  Dependent  Variable:  RESIDA2 

Method:  Least  Squares 

Sample(adjusted): 

1961:2  1994:4;  Included  observations:  135 

Variable 

Coefficient 

Std.  Error 

t- Statistic 

Prob. 

C 

0.000298 

8.16E-05 

3.651675 

0.0004 

RESIDA2(— 1) 

0.233622 

0.077989 

2.995561 

0.0033 

R-squared 

0.063205 

Panel  3:  Dependent  Variable:  D4Y 

Method:  ML  -  ARCH 

Sample:  1961:1  1994:4;  Included  observations:  136 

Convergence  achieved  after  15  iterations 

Variable 

Coefficient 

Std.  Error 

z-Statistic 

Prob. 

C 

0.007454 

0.002138 

3.485798 

0.0005 

D4Y(  — 1) 

1.406509 

0.071004 

19.80898 

0.0000 

D4Y(— 2) 

-0.600541 

0.076058 

-7.895869 

0.0000 

Variance  Equation 

C 

0.000288 

3.24E-05 

8.895637 

0.0000 

ARCH(l) 

0.303640 

0.126185 

2.406315 

0.0161 

R-squared 

0.819478 

Log  likelihood 

342.6833 

S.E.  of  regression 

0.021229 

(d) 


(e) 


Panel  5:  ARCH  Test  (one  lag) 

Fatalistic  0.258530  Prob  0.611975 
Obs*R-squared  0.261908  Prob  0.608812 


Exhibit  7.24  Industrial  Production  (Example  7.20) 


AR(2)  model  for  yearly  growth  rates  in  US  industrial  production  (Panel  1),  ARCH  test  on 
residuals  of  this  model  (denoted  by  RESID,  Panel  2),  AR(2)-ARCH(1)  model  (Panel  3),  test  on 
normality  of  the  standardized  residuals  (d),  and  test  on  remaining  ARCH  in  the  standardized 
residuals  of  the  AR(2)-ARCH(1)  model  (Panel  5). 
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give  LR  =  2(342.7  —  334.2)  =  17.0  (with  P-value  according  to  the  y2(  1) 
distribution  P  =  0.000). 

(ii)  Diagnostic  tests 

Exhibit  7.24  (d)  and  (e)  show  some  diagnostic  tests  on  the  standardized 
residuals  of  the  AR(2)-ARCH(1)  model.  The  ARCH  test  (with  one  lag) 
indicates  that  no  ARCH  effects  remain  in  the  AR(2)-ARCH(1)  model 
( P  =  0.61).  However,  normality  is  rejected  (P  =  0.00)  because  the  kurtosis 
of  5.129  is  still  quite  large.  This  is  due  to  some  isolated  outliers  that  are  not 
captured  well  by  the  ARCH(l)  process. 

Example  7.21:  Dow-Jones  Index  (continued) 

As  GARCH  models  are  of  particular  interest  in  finance  we  now  consider  data 
on  the  Dow-Jones  index.  We  will  discuss  (i)  some  of  the  data  properties  of 
this  series,  (ii)  results  of  two  GARCH  models,  (iii)  prediction  of  the  risk  in 
tomorrow’s  returns,  (iv)  an  evaluation  of  the  quality  of  the  risk  predictions, 
and  (v)  the  prediction  of  high  risks. 

(i)  Data  properties 

We  consider  the  series  of  daily  returns  of  the  Dow-Jones  index  over  the  period 
1990-9.  Panel  1  of  Exhibit  7.25  shows  the  sample  (partial)  autocorrelations 
of  this  series  and  of  the  squares  of  this  series.  The  correlations  in  the  returns 
are  very  small,  so  that  the  returns  cannot  be  predicted  from  their  past. 
However,  the  squared  returns  series  contains  significant  correlations,  so  that 
the  risk  in  the  returns  is  predictable  to  some  extent.  Further,  the  histogram  of 
the  returns  in  (b)  shows  that  the  kurtosis  is  equal  to  8.2,  so  that  this  series  has 
fat  tails.  These  properties  (no  correlation  in  levels,  clustered  volatility,  and  fat 
tails)  motivate  the  use  of  GARCH  models  for  the  returns. 

(ii)  Results  of  two  GARCH  models 

Panels  3  and  6  of  Exhibit  7.25  show  the  results  of  two  GARCH  models.  The 
models  are  of  the  form  yt  =  /i  +  ef,  where  /i  is  the  mean  of  the  returns  and  st 
follows  an  ARCH(5)  process  (in  Panel  3)  or  a  GARCH(2,2)  process  (in  Panel 
6).  The  ARCH  and  GARCH  parameters  are  significant  (except  the  first 
lagged  GARCH  term  in  the  GARCH(2,2)  model).  Exhibit  7.25  (d,  e,  g,  h) 
show  histograms  and  ARCH  tests  (with  five  lags  included)  of  the  standard¬ 
ized  residuals  of  these  two  models.  The  standardized  residuals  contain  no 
ARCH  anymore  (P  =  0.47  in  Panel  4  for  the  ARCH(5)  model  and  P  =  0.61 
in  Panel  7  for  the  GARCH(2,2)  model).  The  kurtosis  has  decreased  (to 
around  5.4  for  the  ARCH(5)  model  (see  (e))  and  5.5  for  the  GARCH(2,2) 
model  (see  {h))),  but  the  standardized  residuals  are  not  normally  distributed. 
This  is  because  the  series  of  returns  contains  some  isolated  outliers  that 
cannot  be  modelled  well  by  a  GARCH  model.  Note  that  the  sample  mean 
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(a)  (b) 


Panel  1 

Lag 

SACF  yt 

SPACF  yt 

SACF  yt2 

SPACF  yt2 

1 

0.030 

0.030 

0.208 

0.208 

2 

-0.017 

-0.018 

0.151 

0.112 

3 

-0.046 

-0.045 

0.069 

0.018 

4 

-0.011 

-0.008 

0.091 

0.062 

5 

-0.013 

-0.014 

0.178 

0.149 

6 

-0.004 

-0.006 

0.066 

-0.011 

7 

-0.047 

-0.048 

0.137 

0.093 

8 

-0.012 

-0.011 

0.057 

0.001 

9 

0.041 

0.039 

0.078 

0.027 

10 

0.039 

0.032 

0.055 

0.002 

Panel  3:  Dependent  Variable:  DJRET 

Method:  ML  -  ARCH 

Sample(adjusted):  2  2528;  Included  observations:  2527 

Convergence  achieved  after  15  iterations 

Variable 

Coefficient 

Std.  Error 

z-Statistic 

Prob. 

C 

0.000668 

0.000165 

4.044671 

0.0001 

Variance  Equation  1 

C 

3.97E-05 

1.67E-06 

23.86055 

0.0000 

ARCH(l) 

0.085180 

0.017965 

4.741323 

0.0000 

ARCH(2) 

0.127996 

0.014382 

8.899731 

0.0000 

ARCH(3) 

0.064421 

0.020266 

3.178738 

0.0015 

ARCH(4) 

0.124051 

0.019582 

6.334810 

0.0000 

ARCH(5) 

0.104425 

0.020429 

5.111540 

0.0000 

S.E.  of  regression 

0.008928 

Akaike  info  criterion 

-6.708544 

Log  likelihood 

8483.245 

Schwarz  criterion 

-6.692381 

( d ) 


Panel  4:  ARCH  Test  on  STRESID  of  ARCH(5) 
5  lags  included  in  the  test  equation 
F-statistic  0.907729  Prob  0.474914 

Obs*R-squared  4.541276  Prob  0.474380 


(e) 


Exhibit  7.25  Dow-Jones  Index  (Example  7.21) 


S(P)ACF  of  the  daily  Dow-Jones  returns  and  of  the  squares  of  this  series  (Panel  1),  histogram  of 
this  series  (b),  ARCH(5)  model  (Panel  3),  ARCH  test  on  standardized  residuals  of  this  model 
(Panel  4),  and  histogram  ( e ). 


of  the  daily  returns  is  0.056  per  cent  (see  the  histogram  of  the  returns  series  in 
Exhibit  7.25  (b)),  but  the  mean  daily  return  is  estimated  as  0.067  per  cent  in 
the  ARCH(5)  model  in  Panel  3  and  as  0.062  per  cent  in  the  GARCH(2,2) 
model  in  Panel  6.  The  estimates  of  the  ARCH  and  GARCH  models  are  more 
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Panel  6:  Dependent  Variable:  DJRET 

Method:  ML  -  ARCH 

Sample(adjusted):  2  2528;  Included  observations:  2527 

Convergence  achieved  after  26  iterations 

Variable 

Coefficient 

Std.Error 

z-Statistic 

Prob. 

C 

0.000621 

0.000154 

4.040140 

0.0001 

1  Variance  Equation 

C 

1.86E-06 

4.11E-07 

4.515180 

0.0000 

ARCH(l) 

0.037997 

0.011418 

3.327782 

0.0009 

ARCH(2) 

0.070522 

0.008987 

7.847085 

0.0000 

GARCH(l) 

0.117028 

0.130196 

0.898860 

0.3687 

GARCH(2) 

0.752955 

0.121789 

6.182480 

0.0000 

S.E.  of  regression 

0.008926 

Akaike  info  criterion 

-6.755778 

Log  likelihood 

8541.926 

Schwarz  criterion 

-6.741925 

(g) 


Panel  7:  ARCH  Test  on  STRESID  of  GARCH(2,2) 
5  lags  included  in  the  test  equation 
F-statistic  0.722747  Pfbb  0.606318 

Obs'R-squared  3.617159  Prob  0.605739 


Exhibit  7.25  ( Contd .) 


(h) 


1  Series:  Standardized  1 

Residuals 

of  GARCH(2,2)  model 
Sample  2  2528 
Observations  2527 

-0.015458 

Median 

-0.002054 

Maximum 

4.973832 

Minimum 

-6.6 83131 

Std.  Dev. 

1.000819 

Skewness 

-0.451443 

Kurtosis 

5.520762 

Jarque-Bera 

754.8827 

Probability 

0.000000 

GARCH(2,2)  model  for  the  daily  Dow-Jones  returns  (Panel  6),  ARCH  test  on  standardized 
residuals  of  this  model  (Panel  7),  and  histogram  (h). 


reliable,  because  the  returns  are  not  normally  distributed  so  that  the  sample 
mean  is  not  an  efficient  estimator. 

(iii)  Prediction  of  tomorrow’s  risk 

Next  we  consider  the  use  of  these  two  models  in  predicting  whether  the  next 
day  is  risky  or  not.  The  models  produce  estimates  of  tomorrow’s  risk 
of  =  var(yd  Y(_i),  which  can  be  compared  with  the  actually  realized  returns. 
The  risks  are  predicted  well  if  the  estimated  risk  at  is  positively  correlated 
with  the  absolute  return  \yt\.  Exhibit  7.25  (z)  and  (/')  contain  plots  of  the 
estimated  conditional  standard  deviations  at  of  both  models.  Panel  11  of 
Exhibit  7.25  shows  the  correlations  between  these  forecasted  standard 
deviations  and  the  series  of  absolute  returns.  The  correlations  are  positive 
(0.26  for  the  ARCH(5)  model  and  0.29  for  the  GARCH(2,2)  model).  So  the 
models  have  some  success  in  predicting  the  risks  in  the  daily  movements  of 
the  Dow-Jones  index.  If  the  forecasted  standard  deviation  for  tomorrow  is 
large,  then  tomorrow  the  return  will,  on  average,  be  relatively  large  (positive 
or  negative). 
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(*')  (?) 


Panel  11:  Correlations  between  absolute  returns  and  conditional  Std.  Dev. 

lytl 

a,  (ARCH5) 

<jt  (GARCH22) 

|ytl 

1.000000 

0.255448 

0.292503 

at  (ARCH5) 

0.255448 

1.000000 

0.782080 

<7t  (GARCH22) 

0.292503 

0.782080 

1.000000 

Panel  12:  Prediction-realization  table  for  large  (absolute)  returns  j 

Real 

Total 

ARCH(5) 

GARCH(2,2)  1 

(7,  <  s 

<7t  >  s 

(7,  <  s 

<7t  >  s 

lytl  <  s 

1900 

1392 

508 

1352 

548 

lytl  >  s 

627 

352 

275 

306 

321 

yt  >  s 

355 

193 

162 

166 

189 

yt  <  -s 

272 

159 

113 

140 

132 

Panel  13:  Prediction-realization  table  for  very  large  (absolute)  returns  1 

Real 

Total 

ARCH(5) 

GARCH(2,2) 

fj t  <  2s 

<7t  >  2s 

<rt  <  2s  <7t  >  2s 

yt  >  2s 

64 

58 

6 

56  8 

yt  <  -2s 

66 

64 

2 

59  7 

Total 

130 

122 

8 

115  15 

Exhibit  7.25  (Contd.) 

Estimated  series  of  conditional  standard  deviations  of  ARCH(5)  model  (i)  and  of  GARCH(2,2) 
model  (/),  correlations  between  absolute  returns  \yt\  and  one-step-ahead  predicted  standard 
deviations  o>  (Panel  11),  prediction-realization  table  for  small  and  large  (absolute)  returns 
against  predicted  standard  deviations  of  the  ARCH(5)  and  GARCH(2,2)  model  (Panel  12), 
and  prediction-realization  table  for  very  large  (absolute)  returns  (Panel  13). 


(iv)  Evaluation  of  the  risk  predictions 

To  evaluate  the  quality  of  the  risk  forecasts  in  more  detail,  Panel  12  of  Exhibit 
7.25  contains  a  prediction-realization  table,  which  is  constructed  as  follows. 
The  sample  mean  and  the  sample  standard  deviation  of  the  returns  yt  are 
equal  to  y  =  0.0558  per  cent  and  s  =  0.8917  per  cent.  As  the  standard 
deviation  is  much  larger  than  the  mean,  we  take  as  one-day-ahead  forecast 
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yt  =  0.  We  call  a  day  risky  if  the  squared  return  is  larger  than  average  —  that 
is,  if  y2  >  s2  or  equivalently  if  \yt\  >  s.  A  day  is  predicted  to  be  risky  if  the 
estimated  conditional  standard  deviation  at  >  s.  As  a  benchmark  we  consider 
the  forecasts  generated  by  the  model  yt  ~  NID(0,  s2).  In  this  model  we 
cannot  predict  tomorrow’s  risk  from  the  past.  But  we  can  randomly  predict 
that  tomorrow  is  risky  with  probability  p  =  P[\yt\  >  s]  and  that  tomorrow  is 
non-risky  with  probability  (1  —  p)  =  P[|yf|  <  s\.  In  this  model  y?A2  ~  r(i), 
so  that  p  —  P[y2  >  s2]  =  P[y2(l)  >  1]  =  0.317.  The  benchmark  model  has 
an  expected  hit  rate  of  correctly  predicting  risky  and  non-risky  days  equal 
to  p1  +  (1  —  p)2  =  0.567.  For  the  ARCFi(5)  model,  275  out  of  627  risky 
days  and  1392  out  of  1900  non-risky  days  are  predicted  correctly,  with 
hit  rate  1667/2527  =  0.660.  For  the  GARCH(2,2)  model,  321  of  the  risky 
days  and  1352  of  the  non-risky  days  are  predicted  correctly,  with  hit 
rate  1673/2527  =  0.662.  Both  hit  rates  are  larger  than  the  benchmark, 
so  that  both  models  are  successful  in  predicting  the  amount  of  risk  one 
day  ahead. 

It  is  also  of  interest  to  distinguish  between  large  positive  returns  yt  >  s 
and  large  negative  returns  yt  <  —s,  as  this  corresponds  to  different  types 
of  risks.  The  ARCH(5)  model  correctly  predicts  162  of  the  355  large  positive 
returns  (45.6  per  cent)  and  113  of  the  272  large  negative  returns  (41.5  per 
cent).  For  the  GARCH(2,2)  model  these  numbers  are  189  out  of  355 
(53.2  per  cent)  and  132  out  of  272  (48.5  per  cent)  respectively.  So  the 
GARCH(2,2)  model  performs  somewhat  better  than  the  ARCH(5)  model 
in  this  respect. 

(v)  Prediction  of  high  risks 

In  practice  it  is  most  relevant  (and  most  difficult)  to  predict  large  future  risks. 
Panel  13  of  Exhibit  7.25  contains  results  for  very  large  returns 
(yt  >  2s  and  yt  <  —2s).  Of  the  2527  observed  daily  returns,  130  days  involve 
such  large  risks  (64  days  with  yt  >  2s  and  66  days  with  yt  <  —2s).  The 
ARCH(5)  model  correctly  predicts  eight  out  of  these  130  days  (6.2  per 
cent),  in  the  sense  that  at  >  2s  for  such  days.  The  GARCH(2,2)  model 
correctly  predicts  fifteen  out  of  these  130  days  (11.5  per  cent).  The  hit 
rates  are  small,  but  still  better  than  what  would  be  obtained  by  random 
predictions  from  the  model  yt  ~  NID(0,  s2).  In  this  model  there  holds 
P[\yt\  >  2s]  =  P[y2 /s2  >  4]  =  P[y2(  1)  >  4]  =  0.046.  If  we  randomly  predict 
tomorrow  to  be  very  risky  with  probability  p  =  0.046  and  not  very  risky 
with  probability  (1  —  p),  then  the  hit  rate  for  very  risky  days  would  be  4.6  per 
cent.  Summarizing,  (G)ARCH  models  help  in  evaluating  the  risk  of  tomor¬ 
row’s  returns  in  financial  markets. 


-©  Exercises:  T:  7.7b-f,  7.8;  E:  7.18c,  d,  7.19,  7.22,  7.23e. 
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7.4.5  Summary 

In  this  section  we  have  considered  different  non-linear  aspects  that  may  be 
relevant  in  time  series  modelling.  The  modelling  of  non-linear  aspects  like 
switching  regimes  and  changing  volatilities  helps  to  understand  the  dy¬ 
namical  structure  of  the  process.  The  outcomes  of  estimates,  diagnostic 
tests,  and  forecasts  depend  on  the  proper  modelling  of  such  non-linear 
aspects.  We  discussed 

•  outliers  in  time  series  (additive  outliers,  innovation  outliers,  level  shifts) 
and  the  modelling  of  outliers  by  means  of  dummy  variables; 

•  changes  in  the  parameters  of  ARMA  models  (sudden  parameter  breaks 
or  more  smooth  transitions); 

•  changes  in  variance  (conditional  heteroskedasticity)  and  the  use  of 
(G)ARCH  models  in  modelling  financial  time  series,  in  constructing 
forecast  intervals,  and  in  predicting  future  risks; 

•  the  ARCH  test  for  conditional  heteroskedasticity,  which  can  also  be 
used  as  a  general  test  for  the  presence  of  non-linearities  in  observed  time 
series. 
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7.5  Regression  models  with  lags 

Uses  Chapters  1-4;  Sections  5.5,  5.7;  Sections  7. 1-7. 3. 


7.5.1  Autoregressive  models  with  distributed  lags 

Time  series  models  with  explanatory  variables 

In  the  foregoing  sections  we  considered  the  modelling  of  time  series  data  of  a 
single  variable  yt,  where  the  value  of  yt  is  explained  in  terms  of  lagged  values 
of  this  variable  and  possibly  lagged  values  of  the  error  term.  If  an  explanatory 
variable  xt  is  available,  then  yt  can  in  addition  be  explained  by  this  variable 
and  its  lagged  values.  If  we  add  xt  and  r  lagged  values  of  xt  to  an  ARMA(p,  q) 
model  for  yt,  we  obtain  the  model 

P  r  q 

Jt  =  *  +  Y  0 kJt-k  +  Y  Pkxt-k  +  Y  d^t-k  +  £f •  (7.32) 

k=i  k=o  k=\ 


This  is  a  dynamic  regression  model  that  incorporates  both  the  autocorrela¬ 
tion  between  successive  observations  of  yt  and  the  correlation  of  yt  with  the 
explanatory  variable  xt  and  its  lags.  The  model  extends  the  regression  model 
of  Chapters  2  and  3  (which  is  obtained  if  =  0  and  9^  —  0  for  all  k)  and  the 
ARMA  model  of  Section  7.1.4  (which  is  obtained  if  =  0  for  all  k). 

Autoregressive  model  with  distributed  lags 

Of  particular  interest  is  the  model  without  MA  component  —  that  is,  with 
q  =  0.  This  is  called  the  autoregressive  model  with  distributed  lags,  also 
denoted  as  ADL(p,  r).  In  this  model,  the  effect  of  the  explanatory  variable 
xt  on  the  dependent  variable  yt  is  distributed  over  time.  The  model  can  be 
written  as 


<j>(L)yt  =  a  +  f}(L)xt  +  st, 

where  cj)(z)  =  1  —  Yk=\  anc^  P(z)  =  Ylk= o^kz>z-  assume  that  the  AR 

polynomial  is  stationary  —  that  is,  that  <j)(z)  =  0  has  all  its  solutions  outside 
the  unit  circle.  A  change  in  xt  has  an  effect  on  yt  that  is  distributed  over  time. 
The  instantaneous  or  short-run  multiplier  is  /i0.  The  long-run  multiplier  (that 
we  denote  by  2)  measures  the  long-run  effect  on  E{yt\  of  a  permanent  change 
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in  the  level  of  xt.  If  the  level  of  xt  increases  by  one  unit  for  all  times,  then  the 
mean  of  yt  increases  by  2  =  Ylk=o  Pk/ (l  —  X^=i  fik)  —  /?(l)/</>(l)  units.  Note 
that  4>(  1)  7^  0,  as  it  is  assumed  that  the  AR  polynomial  has  all  its  roots 
outside  the  unit  circle. 

The  model  (7.32)  is  easily  extended  to  the  case  of  more  than  one  explana¬ 
tory  variable,  but  for  simplicity  we  will  restrict  ourselves  to  a  single  ex¬ 
planatory  variable.  A  special  case  of  the  ADL(1,1)  model  was  discussed  in 
Section  5.5.4  (p.  369)  —  namely,  the  regression  model  with  AR(1)  errors. 
This  corresponds  to  the  ADL(1,1)  model  with  the  parameter  restriction 
P o</>i  +  Pi  =  0,  see  (5.52)  in  Section  5.5.4.  This  model  is  sometimes  used  to 
reduce  some  of  the  serial  correlation  in  (static)  regression  models,  but  in 
practice  the  residuals  of  this  model  often  still  contain  much  serial  correlation. 
In  this  section  we  discuss  some  other  special  cases  of  the  model  (7.32)  that 
have  an  interesting  economic  interpretation. 

Partial  adjustment 

One  of  the  possible  reasons  for  dynamic  effects  in  business  and  economic 
processes  is  that  economic  subjects  adjust  themselves  only  gradually  to 
changing  conditions.  For  instance,  consumption  habits  are  adjusted  only 
gradually  to  changes  in  income  levels  and  prices.  Similarly,  the  sales  of  a 
new  brand  of  a  product  may  reach  its  equilibrium  level  only  after  a  certain 
period  of  time,  when  the  brand  has  established  its  position  in  the  market.  For 
given  value  of  the  independent  variable  xt,  let  the  corresponding  equilibrium 
level  of  yt  be  given  by  y*  =  y  +  Sxt.  Suppose  that  the  actual  level  of  yt  is  only 
partially  adjusted  in  the  sense  that 

jt =  yt- 1  +  A(y*t  —  yt- 1)  + 

where  0  <  2  <  1.  For  instance,  if  the  current  level  yt-\  is  smaller  than  the 
equilibrium  value  y*,  then  the  series  tends  to  be  adjusted  upwards.  The 
adjustment  is  complete  if  2  =  1,  it  is  partial  if  0  <  2  <  1,  and  there  is  no 
adjustment  if  2  =  0.  Substituting  y*  =  y  +  8xt  in  the  above  equation  gives  the 
model 


yt  =  a  +  <t>yt- 1  +  Pxt  +  Et, 


where  a  =  Ay,  </>  =  (1  —  2),  and  /?  =  25.  This  is  an  ADL(1,0)  model.  It  is  called 
the  partial  adjustment  model. 

Adaptive  expectations 

It  may  also  be  that  economic  subjects  decide  on  the  level  of  yt  on  the  basis  of 
expected  values  of  xt+\,  denoted  by  x*t+v  For  instance,  expenses  on  durable 
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goods  may  depend  on  expected  income,  and  the  production  of  a  new  product 
may  be  based  on  expected  sales.  The  model  is  given  by 

yt  =  y  +  8x*t+1  +  st. 

The  expectation  of  the  future  value  of  xt+\  can  be  adjusted  according  to  the 
actual  values  of  xt.  The  adaptive  expectations  model  postulates  that 

X*t+l  =  x*t  +  X(xt  —  X*), 

where  0  <  X  <  1.  For  instance,  if  the  current  value  xt  is  larger  than  expected 
(so  that  xt  >  x^),  then  this  leads  to  an  upward  correction  in  future  expect¬ 
ations.  The  adaptation  is  complete  if  X  =  1  (as  x*+1  =  xt  in  this  case),  it  is 
partial  if  0  <  X  <  1,  and  expectations  are  not  adapted  if  X  =  0  (as  x*+1  =  x*t 
in  this  case).  In  this  model  the  expectations  are  obtained  by  exponential 
smoothing  (with  smoothing  factor  1  —  9  =  X),  so  that  this  gives  the  optimal 
forecasts  if  xt  is  an  ARIMA(0,1,1)  process  (see  Section  7.3.2).  The  unob¬ 
served  variable  x*+1  in  the  above  equation  for  yt  can  be  eliminated  by 
using  yt  -  (1  -  X)yt-\  =  Xy  +5(x*+1  -  (1  -  X)x*t)  +  e*  —  (1  -  A)e(_i,  where 
x*+1  —  (1  —  X)x*t  —  Xxt,  so  that 


yt  =  a  +  <t>yt- 1  +  pXt  +  !:.t  -  fat-1. 


where  a  =  Xy,  (j)  =  ( 1  —  X),  and  ft  =  X5.  This  model  is  of  the  form  (7.32)  with 
orders  p  =  1,  r  =  0,  and  q  =  1.  The  only  difference  with  the  partial  adjust¬ 
ment  model  is  that  the  adaptive  expectations  model  contains  an  additional 
MA(1)  term. 

Error  correction  model 

The  ADL  model  can  be  rewritten  in  terms  of  changes  of  the  variables  —  that 
is,  in  terms  of  the  first  differences  A yt  =  yt  —  yt-\  and  Axt  =  xt  —  xt~\.  We 
consider  this  reformulation  first  for  the  ADL(1,1)  model  —  that  is,  the  model 
(7.32)  with  p  =  1,  r  =  1,  and  q  =  0.  By  subtracting  yt- \  from  both  sides  of 
the  equation  (7.32),  we  can  write  this  model  as  Ayf  =  a  +  (</>—  l)y?-i  + 
/f0Ax?  +  (fl0  +  fa)xt-i  +  £f,  or  equivalently 

A yt  =  P0 Axf  -  (1  -  fa(yt-\  -  Xxt-\  -  S)  +  e„  (7.33) 

with  S  =  a/(l  —  fa  and  where  X  =  (fi0  +  fti)/(X  ~  fa  is  the  long-run  multi¬ 
plier.  Note  that  the  equilibrium  relation  y  =  8  +  Xx  is  obtained  if  in  the 
ADL(1,1)  model  st  =  0  and  the  values  of  yt  =  yt-\  =  y  and  xt  =  xt-\  =  x 
are  fixed,  as  in  this  case  y  —  a  +  4>y  +  Pox  +  Pix,  so  that  y  =  jfa  + 
=  S  +  Xx.  The  ADL(1,1)  model  written  in  the  form  (7.33)  is  called 
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an  error  correction  model  (ECM).  This  shows  that  there  are  two  systematic 
effects  on  the  changes  A yt  of  the  dependent  variable.  The  first  effect  is  the 
instantaneous  multiplier  effect  d0A.xt  that  is  due  to  changes  in  the  explana¬ 
tory  variable.  The  second  effect  concerns  deviations  from  the  long-run 
equilibrium  relation  yt_ \  =  8  +  Xxt-\.  For  instance,  suppose  that  yt-\  > 
8  +  Axf-i,  so  that  the  value  of  yt-\  is  above  the  long-run  equilibrium  value 
corresponding  to  xt-\.  The  stationarity  assumption  implies  that  0  <  1,  so 
that  (1  —  0)  >  0,  and  hence  this  provides  a  negative  effect  on  Ayt  —  that  is,  yt 
will  tend  to  move  downwards  in  the  direction  of  equilibrium.  Therefore 
the  term  —(1  —  4>)(yt-i  —  Axt-\  —  8)  takes  care  that  deviations  from  equilib¬ 
rium  (the  ‘errors’)  are  corrected.  If  0  =  1  — that  is,  if  the  series  yt  has  a  unit 
root  —  then  the  error  correction  term  drops  out  from  (7.33).  In  this  case 
there  exists  no  long-run  equilibrium  for  yt  and  the  long-run  multiplier  is 
infinitely  large. 

It  is  left  as  an  exercise  (see  Exercise  7.9)  to  show  that  for  higher  order  lags 
the  ADL  model  can  also  be  written  in  error  correction  form.  That  is,  the 
ADL(p,  q)  model  with  stationary  AR  polynomial  4>(z)  can  be  written  as 


p- 1  r- 1 

A yt  =  A)A xt  -  </>(l)(yf_i  -  Axt-i  -  8)  +  ^  4>*kkyt-k  +  P*k^xt-k  + 

k=i  k=\ 

(7.34) 

with  0(1)  =  1  —  YlPk=i  04  >  0,  8  =  a/0(l),  and  where  A  =  >5(1)/ 0(1)  is  the 
long-run  multiplier.  As  before,  the  relation  y  =  8  +  Ax  corresponds  to 
the  long-run  equilibrium  relation.  Since  0(1)  >  0,  deviations  from  this  equi¬ 
librium  are  again  corrected  in  this  model. 

Exercises:  T:  7.9a,  b. 


7.5.2  Estimation,  testing,  and  forecasting 

Model  assumptions 

To  estimate  the  parameters  of  the  ADL  model  0(L)yf  =  P(L)xt  +  £?,  we  make 
the  following  assumptions.  First,  the  disturbance  terms  st  satisfy  all  the  usual 
assumptions  —  that  is,  they  have  mean  zero  and  constant  variance  a2,  and 
they  are  mutually  independent  and  jointly  normally  distributed.  If  this  is  not 
the  case,  then  the  model  should  be  adjusted  (by  including  more  lagged  terms 
of  yt  and  xt  or  by  including  MA  terms  as  in  (7.32).  Second,  the  explanatory 
variables  are  exogenous,  in  the  sense  that  all  current  and  past  values 
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{xj_£,  k  >  0}  are  uncorrelated  with  the  error  term  et.  If  this  is  not  the  case, 
then  one  should  instead  specify  a  multiple  equation  model  —  for  instance,  a 
vector  autoregressive  model,  as  will  be  discussed  in  Section  7.6.  Finally,  the 
AR  polynomial  4>(z)  is  stationary  —  that  is,  it  has  all  its  roots  outside  the  unit 
circle  —  and  the  process  xt  is  stationary.  These  conditions  imply  that  yt  is  a 
stationary  process.  If  the  AR  polynomial  contains  unit  roots,  then  yt  contains 
stochastic  trends  and  ADL  models  are  not  appropriate.  The  modelling  of 
series  with  stochastic  trends  is  discussed  in  Section  7.6.3. 


Estimation  of  ADL  models 

The  ADL(p,  q)  model  —  that  is,  (7.32)  without  MA  terms  —  is  a  regression 
model  with  stochastic  regressors.  Under  the  above  model  assumptions,  the 
parameters  can  be  estimated  by  least  squares  and  the  OLS  estimators  have 
the  standard  statistical  properties  discussed  in  Section  4.1.4  (p.  197).  That  is, 
under  the  above  assumptions  OLS  is  consistent  and  the  conventional  t-  and 
F- tests  are  valid  asymptotically.  Because  the  error  terms  are  assumed  to  be 
normally  distributed,  OLS  is  equivalent  to  ML  and  hence  it  is  asymptotically 
efficient.  To  analyse  this  in  more  detail,  we  consider  for  simplicity  the 
ADL(1,1)  model.  Let  Zt  =  (l,yt-i,xt,xt-i)'  be  the  vector  of  regressors  in 
this  model.  Then  the  stability  and  orthogonality  conditions  for  stochastic 
regressors  formulated  in  Section  4.1  can  be  formulated  as 


plim 


plim 


Qzz-> 


where  Qzz  is  an  invertible  matrix.  The  first  (orthogonality)  condition  is 
satisfied  because  the  explanatory  variable  xt  and  its  lagged  values  are  as¬ 
sumed  to  be  exogenous.  The  second  (stability)  condition  requires  the  vari¬ 
ables  to  have  finite  variances  and  covariances,  and  this  is  satisfied  because  the 
processes  yt  and  xt  are  assumed  to  be  stationary.  It  is  sufficient  that  the 
process  xt  is  stationary  and  that  —  1  <  0  <  1.  If  the  process  yt  has  a  unit 
root,  then  the  stability  condition  is  not  satisfied.  This  is  because  in  this  case  yt 
has  infinite  variance  for  t  —r  oo  (see  Section  7.3.1).  Then  the  OLS  estimators 
do  not  have  the  standard  properties  discussed  in  Section  4.1  (see  also  the 
discussion  in  Section  7.3.3  on  unit  root  tests). 


Estimation  of  models  with  MA  terms 

If  the  model  (7.32)  contains  moving  average  terms,  then  least  squares  is 
no  longer  consistent.  For  instance,  in  the  case  of  single  lags  (so  that 
p  =  q  —  r  =  1)  and  with  exogenous  regressors  xt  and  xt-\,  the  covariance 
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between  the  regressor  yt-\  and  the  error  term  et  +  6st- 1  in  (7.32)  is 
equal  to 


cov(a  +  < pyt-2  +  Po*t-i  +  P\xt-2  +  &t- 1  +  @£f-2,  £f  +  0£j-i)  =  da2  ±  0. 

Models  with  MA  terms  should  be  estimated  by  maximum  likelihood.  The 
standard  asymptotic  theory  for  ML  estimators  applies,  provided  that  the 
processes  xt  and  yt  are  stationary. 

Diagnostic  tests 

If  the  processes  xt  and  yt  are  stationary,  then  the  diagnostic  tests  described 
before  (in  Chapter  5  for  regression  models  and  in  Section  7.2.4  for  stationary 
time  series)  can  all  be  applied  to  models  of  the  form  (7.32).  For  instance,  the 
lag  lengths  p,  q,  and  r  can  be  chosen  by  means  of  f -tests  and  F-tests  on  the 
significance  of  additional  lagged  terms,  and  also  by  means  of  the  selection 
criteria  AIC  and  SIC.  The  Breusch-Godfrey  LM-test  on  serial  correlation  and 
the  ARCH  LM- test  on  heteroskedasticity  can  be  performed  as  before  on  the 
residuals  of  the  estimated  model. 
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Forecasting 

To  use  the  model  (7.32)  in  one-period-ahead  forecasting  we  need  to  know  or 
to  estimate  the  value  of  xn+\  and  we  have  to  estimate  the  value  of  e„.  The  best 
we  can  do  is  to  replace  the  unknown  parameters  by  their  ML  estimates  and  to 
take  the  resulting  residual  series  as  an  estimate  of  the  error  terms  st.  For 
instance,  in  the  model  with  single  lags  (p  =  q  =  r  =  1),  the  forecast  is  then 
given  by  y„+ \  =  (j>y„  +  P0x„+i  +  Pi x„  +  0en,  where  e„  =  y„  -  y„.  The  re¬ 
quired  forecast  xn+\  of  the  explanatory  variable  can  be  obtained,  for  in¬ 
stance,  from  a  (univariate)  time  series  model  for  xt  as  discussed  in  Section 
7.1.6.  For  multi-period-ahead  forecasts,  the  future  values  of  yt  that  appear  as 
regressors  on  the  right-hand  side  in  (7.32)  are  themselves  forecasted.  Com¬ 
bined  with  the  additional  uncertainty  in  the  forecasts  of  the  explanatory 
variables  xt,  this  implies  that  the  forecast  intervals  will  be  wider  than  in  the 
case  of  known,  non-stochastic  regressors  considered  in  Sections  2.4.1  (p.  106) 
and  3.4.3  (p.  171). 

Example  7.22:  Interest  and  Bond  Rates  (continued) 

To  illustrate  the  use  of  the  regression  model  with  lags,  we  consider 
the  relation  between  the  monthly  changes  yt  in  the  AAA  bond  rate  and  the 
monthly  changes  xt  in  the  three-month  Treasury  Bill  rate.  These  data  were 
also  used  in  Chapter  5;  see  Example  5.11  (p.  322)  for  further  motivation.  The 
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data  cover  the  period  1950-99,  leading  to  n  =  600  monthly  observations. 
We  will  subsequently  discuss  (i)  the  simple  regression  model,  (ii)  an  ADL 
model,  (iii)  an  ADL  model  with  GARCH  error  terms,  (iv)  interpretation  of 
the  GARCH  effects,  and  (v)  some  important  remarks  on  further  relevant  data 
properties. 

(i)  The  simple  regression  model  with  AR(1)  errors 

In  Chapter  5  we  considered  the  simple  regression  model  yt  =  a  +  [ixt  +  et  for 
these  data.  This  model  was  analysed  in  Examples  5.19  (p.  354),  5.21  (p.  360), 
5.22  (p.  365),  and  5.24  (p.  370).  The  conclusion  of  this  analysis  was  that  the 
residuals  of  the  simple  regression  model  show  significant  serial  correlation 
and  that  an  AR(1)  model  for  the  error  terms  et  is  too  limited  to  remove  this 
serial  correlation.  Therefore  other  models  are  needed  that  reflect  the  dynam¬ 
ical  relations  between  the  two  variables.  For  later  comparison  the  estimation 
results  of  the  simple  regression  model  (in  Panel  1  of  Exhibit  5.30  (p.  360))  are 
given  once  more  in  Panel  1  of  Exhibit  7.26. 

(ii)  ADL  model 

To  model  the  dynamical  relations  between  the  bond  rate  yt  and  the  interest 
rate  xt  we  now  estimate  ADL  models.  We  start  with  an  ADL(p,  r)  model 
that  contains  enough  lags  so  that  the  residuals  are  not  serially  correlated 
anymore.  It  turns  out  that  it  is  sufficient  to  include  p  —  r  =  6  lags.  Next  we 
try  to  reduce  the  number  of  lags,  by  considering  the  significance  of  lagged 
terms,  by  comparing  SIC  values,  and  by  checking  for  serial  correlation 
in  ADL  models  with  lower  orders  for  p  and  r.  This  leads  to  the  ADL  model 
with  p  =  3  lags  of  yt  and  r  =  4  lags  of  xt.  The  estimation  results  of  the 
ADL(3,4)  model  are  given  in  Panel  2  of  Exhibit  7.26.  The  correlations 
of  the  residuals  of  this  model  are  small,  with  a  largest  correlation  of  0.078 
at  lag  6  (see  Panel  3).  The  Breusch-Godfrey  test  on  serial  correlation  gives 
P-values  of  P  =  0.07  (with  one  lag  included)  and  P  =  0.03  (with  twelve  lags 
included).  So,  at  the  1  per  cent  significance  level  (which  is  a  reasonable  choice 
for  n  =  600  observations),  there  is  no  significant  serial  correlation.  We 
conclude  that  the  dynamic  correlations  between  the  changes  in  the  AAA 
bond  rate  and  the  three-month  Treasury  Bill  rate  are  captured  well  by  this 
model.  The  short-run  multiplier  (the  coefficient  of  xt )  in  the  ADL(3,4)  model 
is  0.24.  This  is  quite  close  to  the  estimated  value  of  0.27  in  the  simple 
regression  model. 

It  is  of  interest  to  test  whether  changes  in  the  interest  rate  in  the  long  run 
lead  to  equal  changes  in  the  bond  rate  —  that  is,  to  test  whether  the  long- 
run  multiplier  is  equal  to  1.  This  corresponds  to  the  parameter  restriction 
=  E*=0&/(1  -  ELi  <!>k)  =  1>  or,  equivalently,  to  Yl=oPk  +  ELi  = 
Panel  4  of  Exhibit  7.26  shows  the  outcomes  of  the  F- test  for  this  hypothesis. 
The  hypothesis  is  clearly  rejected  (E  =  119  with  P  =  0.00). 
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(iii)  ADL  model  with  GARCH  error  terms 

The  residuals  of  the  ADL  model  are  not  serially  correlated,  but  they  exhibit 
clustered  volatility.  This  is  clear  from  the  time  series  plot  of  the  residuals  in 
Exhibit  7.26  (e).  The  ARCH  test  (with  four  lags)  on  the  residuals  is  very 
significant  (see  Panel  6).  Therefore  we  now  estimate  the  ADL(3,4)  model 
with  GARCH(1,1)  error  terms,  with  results  in  Panel  7  of  Exhibit  7.26.  The 
residuals  of  the  ADL(3,4)-GARCH(1,1)  model  do  not  contain  clustered 
volatility  anymore  (the  ARCH  LM- test  with  four  lags  gives  P  =  0.087,  see 
Panel  8).  Further  note  that  the  lagged  values  of  xt  have  become  jointly 
insignificant  in  this  model  (the  corresponding  F- test  has  P  =  0.58,  see 
Panel  9).  This  means  that  the  dynamic  effects,  which  were  attributed  in  the 
ADL  model  to  past  changes  in  the  Treasury  Bill  rate,  are  better  described  in 
terms  of  clustered  volatility  of  the  AAA  bond  rate  changes. 

(iv)  Interpretation  of  the  GARCH  effects 

Exhibit  7.26  (/)  shows  the  time  series  plot  of  the  predicted  standard  devi¬ 
ations  obtained  from  the  GARCH(1,1)  model.  The  conditional  standard 
deviation  shows  much  variation  over  time.  In  particular  there  seems  to  be  a 
break  in  the  volatility  around  1980.  The  GARCH(1,1)  model  captures  the 
variations  in  the  volatility  of  the  series  much  better  than  the  models  discussed 
in  Example  5.18  (p.  350-2)  for  these  heteroskedastic  data.  Exhibit  7.26  (k) 
shows  the  scatter  plot  of  the  absolute  changes  in  the  AAA  bond  rate  against 
the  predicted  standard  deviation.  These  variables  are  positively  correlated. 
On  average,  if  the  predicted  standard  deviation  is  relatively  large  (small)  then 
the  (absolute)  change  in  the  AAA  bond  rate  is  also  relatively  large  (small).  So 
this  model  is  helpful  in  predicting  risky  months  —  that  is,  months  where  the 
AAA  bond  rates  contain  much  uncertainty. 

(v)  Remarks  on  further  relevant  data  properties 

We  will  return  later  to  these  data  to  answer  two  remaining  questions.  The 
first  is  why  the  model  is  formulated  for  the  changes  and  not  for  the  levels  of 
the  two  rates.  The  reason  is  that,  in  estimating  the  ADL  model,  it  is  assumed 
that  the  series  are  stationary.  We  will  analyse  this  issue  further  in  Example 
7.25  in  the  next  section  and  especially  in  Example  7.27  in  Section  7.6.3.  It 
will  turn  out  that  it  is  better  to  model  the  levels  of  the  two  time  series,  and  not 
the  series  of  first  differences.  The  second  question  is  related  to  the  possible 
endogeneity  of  the  Treasury  Bill  rate.  Indeed,  in  Example  5.33  (p.  414-16) 
we  concluded  that  xt  is  endogenous.  This  means  that  a  fundamental  assump¬ 
tion  of  the  ADL  model  is  violated,  so  that  the  results  of  the  ADL  model  may 
be  misleading.  Models  that  account  for  the  joint  endogeneity  of  both  time 
series  are  considered  in  Examples  7.26  and  7.32. 
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Panel  1:  Dependent  Variable:  DAAA 


Method:  Least  Squares 

Sample:  1950:01  1999:12;  Included  observations:  600 


Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

0.006393 

0.006982 

0.915697 

0.3602 

DUS3MT 

0.274585 

0.014641 

18.75442 

0.0000 

R-squared 

0.37034 6 

Mean  dependent  var 

0.008283 

S.E.  of  regression 

0.171002 

Akaike  info  criterion 

-0.690952 

Sum  squared  resid 

17.48658 

Schwarz  criterion 

-0.676296 

Log  likelihood 

209.2857 

F-statistic 

351.7282 

Durbin-Watson  stat 

1.446887 

Prob(F-statistic) 

0.000000 

Panel  2:  Dependent  Variable:  DAAA 

Method:  Least  Squares 

Sample:  1950:01  1999:12;  Included  observations:  600 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

0.004909 

0.006518 

0.753160 

0.4517 

DAAA(-l) 

0.376847 

0.042230 

8.923663 

0.0000 

DAAA(— 2) 

-0.229060 

0.044907 

-5.100809 

0.0000 

DAAA(— 3) 

0.087534 

0.042553 

2.057033 

0.0401 

DUS3MT 

0.240321 

0.015169 

15.84308 

0.0000 

DUS3MT(— 1) 

-0.084951 

0.018208 

-4.665701 

0.0000 

DUS3MT(— 2) 

0.080341 

0.018521 

4.337855 

0.0000 

DUS3MT(— 3) 

-0.061728 

0.018403 

-3.354288 

0.0008 

DUS3MT(— 4) 

0.055952 

0.014510 

3.856226 

0.0001 

R-squared 

0.459580 

Mean  dependent  var 

0.008283 

S.E.  of  regression 

0.159358 

Akaike  info  criterion 

-0.820443 

Sum  squared  resid 

15.00839 

Schwarz  criterion 

-0.754489 

Log  likelihood 

255.1329 

F-statistic 

62.82430 

Durbin-Watson  stat 

2.022425 

Prob(F-statistic) 

0.000000 

Panel  3:  correlograms  of  residuals  of  ADL(0,0)  and  ADL(3,4) 
Sample:  1950:01  1999:12;  Included  observations:  600 


ADL(0,0) 

ADL(3,4) 

Lag 

SACF 

Q-Stat 

Prob 

SACF 

Q-Stat 

Prob 

1 

0.276 

45.932 

0.000 

-0.013 

0.1064 

0.744 

2 

-0.076 

49.398 

0.000 

0.014 

0.2288 

0.892 

3 

0.008 

49.441 

0.000 

-0.030 

0.7695 

0.857 

4 

0.034 

50.126 

0.000 

0.021 

1.0323 

0.905 

5 

0.055 

51.939 

0.000 

0.011 

1.1061 

0.954 

6 

0.101 

58.189 

0.000 

0.078 

4.8142 

0.568 

7 

0.035 

58.934 

0.000 

-0.013 

4.9128 

0.671 

8 

0.049 

60.412 

0.000 

0.059 

7.0251 

0.534 

9 

0.044 

61.610 

0.000 

0.032 

7.6697 

0.568 

10 

0.008 

61.646 

0.000 

0.012 

7.7540 

0.653 

11 

0.032 

62.289 

0.000 

0.045 

8.9743 

0.624 

12 

-0.062 

64.624 

0.000 

-0.037 

9.8041 

0.633 

Exhibit  7.26  Interest  and  Bond  Rates  (Example  7.22) 

Simple  regression  of  DAAA  on  DUS3MT  (ADL(0,0)  model,  Panel  1)  and  ADL(3,4)  model 
(Panel  2)  with  correlograms  of  residuals  (Panel  3). 
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(d) 


Panel  4:  Wald  test  on  long-run  multiplier 

Null  Hypothesis:  C(2)+C(3)+C(4)+C(5)+C(6)+C(7)+C(8)+C(9)=l 
F-statistic  119.0438  Probability  0.000000 

Chi-square  119.0438  Probability  0.000000 


(e) 


|  —  DAAA  Residuals  | 


(f) 


Panel  6:  ARCH  Test  on  residuals  of  the  ADL(3,4)  model 
4  lags  of  squared  residuals  included  in  the  test  equation 
F-statistic  28.63848  Probability  0.000000 

Obs*R-squared=596*  0.1 62360  96.76672  Probability  0.000000 


Panel  7:  Dependent  Variable:  DAAA 

Method:  ML  -  ARCH 

Sample:  1950:01  1999:12;  Included  observations:  600 

Convergence  achieved  after  33  iterations 

Variable 

Coefficient 

Std.  Error 

z- Statistic 

Prob. 

C 

0.001375 

0.002852 

0.482058 

0.6298 

DAAA(-l) 

0.395604 

0.048109 

8.223042 

0.0000 

DAAA(— 2) 

-0.194726 

0.045298 

-4.298749 

0.0000 

DAAA(— 3) 

0.054381 

0.042720 

1.272948 

0.2030 

DUS3MT 

0.144789 

0.012544 

11.54285 

0.0000 

DUS3MT(— 1) 

-0.002217 

0.016460 

-0.134683 

0.8929 

DUS3MT(— 2) 

0.020133 

0.016000 

1.258328 

0.2083 

DUS3MT(— 3) 

-0.020644 

0.015003 

-1.375931 

0.1688 

DUS3MT(-4) 

0.000278 

0.012548 

0.022179 

0.9823 

Variance  Equation 

C 

9.97E-05 

2.09E-05 

4.776163 

0.0000 

ARCH(l) 

0.183616 

0.028026 

6.551611 

0.0000 

GARCH(l) 

0.836600 

0.021622 

38.69214 

0.0000 

R-squared 

0.389533 

Mean  dependent  var 

0.008283 

S.E.  of  regression 

0.169803 

Akaike  info  criterion 

-1.560654 

Sum  squared  resid 

16.95374 

Schwarz  criterion 

-1.472716 

Log  likelihood 

480.1962 

F-statistic 

34.10875 

Durbin-Watson  stat 

1.991229 

Prob(F-statistic) 

0.000000 

Exhibit  7.26  (Contd.) 

Wald  test  on  unit  long-run  multiplier  (Panel  4),  plot  of  residuals  of  the  ADL(3,4)  model  ( e ), 
ARCH(4)  test  on  these  residuals  (Panel  6),  and  ADL(3,4)  model  with  GARCH(1,1)  disturb¬ 
ances  (Panel  7). 
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(h) 


Panel  8:  ARCH  Test  on  residuals  of  the  ADL(3,4)  -  GARCH(1,1)  model 
4  lags  of  squared  residuals  included  in  the  test  equation 
F-statistic  2.045773  Probability  0.086559 

Obs*R-squared  =  596*0.013657  8.139619  Probability  0.086596 


Panel  9:  Wald  Test 

on  significance  of  lagged  terms  of  DUS3MT 

Wald  Test 

C(6)=0,  C(7)=0,  C(8)=0,  C(9)=0 

F-statistic 

Chi-square 

0.711700  Probability 

2.846801  Probability 

0.584132 

0.583781 

(j)  (k) 


CONDSTDEV 


Exhibit  7.26  ( Contd .) 

ARCH(4)  test  on  residuals  of  ADL(3,4)-GARCH(  1,1)  model  (Panel  8),  F-test  on  significance  of 
the  four  lagged  terms  of  DUS3MT  in  this  model  (Panel  9),  plot  of  one-month-ahead  predicted 
standard  deviations  of  GARCH(1,1)  (denoted  by  CONDSTDEV  (/)),  and  scatter  diagram  of 
absolute  monthly  changes  in  the  AAA  bond  rate  (denoted  by  ABSDAAA)  against  the  predicted 
standard  deviation  ((k),  with  regression  line). 


Exercises:  E:  7.21a,  d-f. 


7.5.3  Regression  of  variables  with  trends 

Danger  of  spurious  regressions 

In  the  foregoing  section  we  described  the  estimation  and  testing  of  regression 
models  for  time  series  data.  The  outcomes  can  be  evaluated  in  the  usual  way, 
provided  that  the  sample  is  large  enough,  the  explanatory  variables  are 
exogenous,  and  both  the  dependent  and  the  explanatory  variables  are  station¬ 
ary.  If  the  variables  contain  stochastic  trends,  then  the  application  of  standard 
regression  techniques  may  lead  to  misleading  results.  That  is,  regression  may 
lead  to  nonsense  correlations  (seemingly  significant  correlations)  or  to  spuri¬ 
ous  regressions  (seemingly  significant  effects)  that  are  due  only  to  the  presence 
of  neglected  trends  in  the  variables.  We  will  illustrate  this  by  means  of  a 
historical  example  and  by  means  of  a  simple  simulation  experiment. 
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Example  7.23:  Mortality  and  Marriages 

First  we  consider  data  from  a  historically  influential  paper  on  spurious 
regressions.  We  will  discuss  (i)  the  data  and  (ii)  the  results  of  regressions 
with  these  trending  data. 

(i)  The  data 

Exhibit  7.27  (a)  shows  yearly  data  on  the  standardized  mortality  (per  1000 
persons)  in  England  and  Wales  and  on  the  proportion  of  Church  of  England 


(a) 
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70  75  80  85  90  95  00  05  10 
|°  STMORT  a  CEMARRl 


(b) 


(c) 


Panel  2:  Dependent  Variable:  STMORT;  Sample:  1866  1911;  Method:  OLS 

Variable 

Coefficient 

Std.  Error 

t-Statistic  Prob. 

C 

-13.88367 

1.573472 

-8.823593  0.0000 

CEMARR 

0.046137 

0.002249 

20.51460  0.0000 

R-squared 

0.905346 

Panel  3:  Dependent  Variable:  STMORT;  Sample:  1866  1911;  Method:  OLS 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-20.70949 

9.400182 

-2.203095 

0.0330 

@TREND(1866) 

0.030870 

0.041907 

0.736640 

0.4653 

CEMARR 

0.054920 

0.012135 

4.525843 

0.0000 

R-squared 

0.906525 

Panel  4:  Dependent  Var:  D(STMORT);  Sample(adj):  1867  1911;  Method:  OLS 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-0.132989 

0.210475 

-0.631852 

0.5308 

D(  CEMARR) 

0.011539 

0.042687 

0.270319 

0.7882 

R-squared 

0.001696 

Exhibit  7.27  Mortality  and  Marriages  (Example  7.23) 

Time  plots  of  the  standardized  mortality  per  1000  persons  in  England  and  Wales  (STMORT) 
and  of  the  proportion  of  Church  of  England  marriages  per  1000  of  all  marriages  (CEMARR) 
((a),  left  axis  for  STMORT,  right  axis  for  CEMARR,  horizontal  axis  for  years  1866-1911), 
regression  without  trend  (Panel  2)  and  with  deterministic  trend  (Panel  3)  and  regression  of 
the  variables  in  first  differences  (Panel  4). 
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marriages  (per  1000  of  all  marriages)  for  the  period  1866-1911.  The  data  in 
the  file  are  reconstructed  from  figure  1  in  G.  U.  Yule,  ‘Why  do  we  Sometimes 
Get  Nonsense-Correlations  between  Time-Series’,  Journal  of  the  Royal  Stat¬ 
istical  Society ,  89  (1926),  1-64,  at  p.  3.  The  sample  correlation  between 
the  two  original  variables  reported  in  the  paper  is  0.9512,  and  for  the 
reconstructed  data  this  correlation  is  0.9515,  which  indicates  that  the  recon¬ 
structed  data  are  quite  close  to  the  original  ones. 

(ii)  Results  of  regressions 

We  perform  a  regression  of  mortality  on  the  proportion  of  Church  of 
England  marriages  (either  with  a  constant  included  or  with  a  constant  and  a 
deterministic  trend  included) .  The  results  are  in  Panels  2  and  3  of  Exhibit  7.27. 
In  both  cases  the  effect  of  marriages  on  mortality  is  highly  significant 
(the  t-value  is  20.5  in  the  model  without  deterministic  trend  and  it  is  4.5  in 
the  model  with  deterministic  trend).  It  seems  quite  unlikely  that  the  way 
people  marry  has  anything  to  do  with  the  mortality  in  the  same  year.  The 
positive  association  of  both  variables  is  due  to  their  common  decline  over 
the  sample  period.  This  becomes  clear  if  we  regress  the  variables  after  taking 
first  differences,  to  remove  the  stochastic  trends  in  both  variables.  The  results 
in  Panel  4  of  Exhibit  7.27  show  that  the  effect  of  changes  in  marriages  on 
changes  in  mortality  is  not  at  all  significant  (the  t-value  is  0.27  with  P  =  0.79). 
Results  like  the  above  ones  have  inspired  the  practice  to  take  first  differences 
of  trended  variables  in  order  to  prevent  nonsense  correlations. 

Example  7.24:  Simulated  Random  Walk  Data 

Next  we  illustrate  the  possibility  of  spurious  regressions  by  means  of  a 
simulation.  We  generate  two  independent  random  walks,  yt  =  y%~  1  +  f]t 
and  xt  =  xt-\  +  ut,  where  r]t  and  u jt  are  two  independent  white  noise  pro¬ 
cesses.  So  the  two  variables  yt  and  xt  are  completely  unrelated.  The  two  series 
are  independent  random  walks.  Exhibit  7.28  (a-d)  show  scatter  diagrams  of 
yt  against  xt  and  Panels  5-8  show  the  outcomes  of  regressions  of  yt  on  xt  for 
different  sample  sizes  (n  =  10,  n  =  100,  n  =  1000,  and  n  =  10,000).  For 
larger  sample  sizes  the  effect  of  xt  on  yt  becomes  more  and  more  significant  if 
measured  by  the  f-value  of  the  slope  coefficient.  The  R 1  is  also  quite  large  and 
suggests  a  significant  relationship  between  xt  and  yt.  Note,  however,  that  the 
estimated  slope  is  negative  for  sample  size  n  —  1000  but  positive  for  the  other 
three  sample  sizes.  That  something  is  wrong  with  these  regressions  is  indi¬ 
cated  by  the  Durbin-Watson  statistic,  which  comes  very  close  to  zero  in  large 
samples.  The  estimated  effects  are,  of  course,  spurious  for  these  data.  To 
prevent  this  kind  of  nonsense  regressions  one  can  take  first  differences  of 
variables  that  contain  stochastic  trends.  In  our  simulation  this  gives  the  two 
white  noise  series  rjt  and  c jt,  and  the  corresponding  regressions  are  not  at  all 
significant  anymore. 


E 
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Exhibit  7.28  Simulated  Random  Walk  Data  (Example  7.24) 

Scatter  diagrams  of  two  independent  random  walks  for  four  sample  sizes  ( n  =  10  in  (a), 
n  =  100  in  (b),  n  =  1000  in  (c),  and  n  =  10, 000  in  (d)). 


Statistical  causes  of  spurious  regressions 

To  obtain  some  understanding  of  the  statistical  reasons  for  possible  spurious 
results  with  trending  variables  we  consider  the  above  simulation  in  Example 
7.24  in  more  detail.  The  data  are  generated  by  yt  =  yt- 1  +  rjt  and 
xt  =  xt-\  +  iot,  and  the  regression  model  is  given  by  yt  =  a.  +  ftxt  +  st.  As 
the  two  processes  are  independent,  the  data  generating  process  corresponds 
to  a  =  0  and  ft =  0 .  So  the  error  terms  of  the  model  are  equal  to 
Bt  =  yt  =  yi  +  Y^k= 2  ’Ik-  This  implies  that  the  error  terms  are  very  strongly 
correlated.  For  instance,  if  y\  =  0  then  var(fif)  =  (t  —  l)er2  and 
cov(fif,  es)  =  (s  —  1  )cr2  for  t  >  s  >2.  In  the  simulation  this  strong  positive 
serial  correlation  was  indicated  by  the  Durbin-Watson  statistic,  which  is 
close  to  zero.  There  are  two  reasons  why  the  conventional  t-test  on  signifi¬ 
cance  is  misleading  in  this  case.  The  first  is  that,  in  the  case  of  serial 
correlation,  the  t-value  of  the  estimated  slope  does  not  follow  the  t-distribu- 
tion  anymore,  as  was  discussed  in  Section  5.5.2  (p.  359).  The  second  reason 
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Panel  5:  Dependent  Variable:  Y;  Sample:  1  10;  Included  obs.  10 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

15.90026 

1.241768 

12.80454 

0.0000 

X 

0.434657 

0.182335 

2.383839 

0.0443 

R-squared 

0.415320 

Durbin-Watson  stat 

2.217163 

Panel  6:  Dependent  Variable:  Y;  Sample:  1  100;  Included  obs.  100 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

16.47030 

0.694868 

23.70278 

0.0000 

X 

0.204200 

0.050861 

4.014873 

0.0001 

R-squared 

0.141249 

Durbin-Watson  stat 

0.299211 

Panel  7:  Dependent  Variable:  Y;  Sample:  1  1000;  Included  obs.  1000 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

56.80644 

0.428361 

132.6135 

0.0000 

X 

-0.689132 

0.029447 

-23.40248 

0.0000 

R-squared 

0.354328 

Durbin-Watson  stat 

0.009218 

Panel  8:  Dependent  Variable:  Y;  Sample:  1  10000;  Included  obs.  10000 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

51.78272 

0.455985 

113.5623 

0.0000 

X 

1.041574 

0.007875 

132.2616 

0.0000 

R-squared 

0.636319 

Durbin-Watson  stat 

0.002408 

Exhibit  7.28  ( Contd .) 

Regressions  corresponding  to  the  scatter  diagrams  in  (a-d). 


is  that  the  regressor  xt  is  non-stationary.  In  Section  4.1.4  (p.  197)  we  con¬ 
cluded  that,  under  the  usual  assumptions,  the  f-test  is  still  valid  asymptotic¬ 
ally  for  stochastic  regressors  that  are  stable.  However,  in  the  simulation  the 
variable  xt  is  a  random  walk  that  does  not  satisfy  the  stability  condition 
because  plim(i^”=1  xj)  =  oo.  This  also  affects  the  distribution  of  the 
t-value,  as  was  discussed  in  Section  7.3.3. 

Differencing  to  remove  stochastic  trends 

The  foregoing  results  motivate  the  practice  to  take  first  differences  of  vari¬ 
ables  with  trends,  in  order  to  prevent  spurious  results.  This  is  to  protect 
ourselves  from  claiming  the  existence  of  significant  relations  that  are  caused 
only  by  neglected  trends  in  the  observed  variables.  However,  by  taking  first 
differences  the  interpretation  of  the  model  changes.  The  model  is  then 
concerned  with  the  short-run  relationship  between  the  variables,  as  their 
long-run  dependence  is  eliminated  in  this  way.  To  explain  this  in  more  detail, 
suppose  that  the  two  variables  yt  and  xt  are  both  integrated  of  order  1  —  that 
is,  they  contain  stochastic  trends  and  the  series  A yt  and  \xt  are  stationary. 
Suppose  that  we  initially  specify  an  ADL  model  to  explain  yt  in  terms  of  xt. 
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This  model  can  be  written  in  error  correction  form  (7.34).  If  0(1)  ^  0,  then 
this  model  contains  the  error  correction  term  {yt-i  —  Xxt-\),  which  has  an 
interesting  economic  interpretation  in  terms  of  the  long-run  equilibrium 
between  the  levels  of  the  two  variables.  If  we  were  to  follow  the  practice 
of  removing  the  trends  and  making  an  ADL  model  for  A yt  in  terms  of  Axt, 
then  the  error  correction  term  would  drop  out  of  the  model.  This  is  a 
correct  procedure  if  0(1)  =  0,  but  we  would  omit  an  important  regressor 
if  0(1)  ^0. 

Cointegrated  time  series 

If  the  two  variables  yt  and  xt  are  both  integrated  of  order  1  —  that  is,  if  they 
both  contain  stochastic  trends  —  then  the  ECM  model  (7.34)  has  an  interest¬ 
ing  interpretation  for  0(1)  ^  0.  In  this  case  the  term  {yt-\  —  kxt-\  —  8)  in 
(7.34)  can  be  written  as  a  linear  combination  of  the  stationary  variables 
st,  A yt,  and  Axf  and  their  lags.  This  implies  that  (yt- 1  —  kxt-\)  is  also  sta¬ 
tionary.  That  is,  in  this  case  the  variables  yt  and  xt  are  integrated  of  order  1, 
but  the  linear  combination  (yt  —  Xxt)  is  stationary.  The  series  yt  and  xt  are 
then  said  to  be  cointegrated.  Stated  intuitively,  if  two  series  are  cointegrated, 
then  they  share  one  common  trend  that  drops  out  in  the  linear  combination 
(yt  —  kxt).  A  regression  of  yt  on  xt  is  then  not  spurious,  as  the  relation  is 
caused  by  one  trend  term  that  is  common  to  the  two  variables.  The  regression 
is  even  of  the  greatest  interest,  as  it  provides  the  long-run  equilibrium  relation 
between  the  two  series.  So  we  should  not  take  first  differences  if  the  variables 
are  cointegrated. 

Summarizing,  if  the  observed  series  contain  stochastic  trends,  then  we 
should  proceed  as  follows.  If  the  series  are  not  cointegrated,  then  we  should 
take  first  differences  to  prevent  spurious  results.  But,  if  the  variables  are 
cointegrated,  we  should  estimate  an  error  correction  model  to  incorporate 
the  long-run  relations  between  the  variables.  Tests  for  cointegration  and  the 
modelling  of  cointegrated  time  series  are  further  discussed  in  Section  7.6.3. 

Example  7.25:  Interest  and  Bond  Rates  (continued) 

We  continue  our  analysis  of  the  interest  and  bond  rate  data  of  Example  7.22 
in  the  previous  section.  We  will  discuss  (i)  the  data  (levels  and  first  differ¬ 
ences),  (ii)  an  error  correction  model  for  these  data,  and  (iii)  the  interpret¬ 
ation  of  this  model. 

(i)  The  data 

Exhibit  7.29  shows  the  monthly  data  on  the  AAA  bond  rate  and  the  three- 
month  Treasury  Bill  rate,  both  in  levels  (in  (a)  and  (c))  and  in  first  differences 
(in  (b)  and  (d)).  Both  variables  show  prolonged  upward  and  downward 
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Panel  5:  Dependent  Variable:  DAAA 
Method:  Least  Squares 

Sample:  1950:01  1999:12;  Included  observations:  600 
Convergence  achieved  after  4  iterations 


|DAAA  =  C(1)*DUS3MT  +  C(2)*(AAA(-1)  -  C(3)*US3MTBIL(-1)  -  C(4))| 


Parameter 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C(l) 

iPo) 

0.279889 

0.014308 

19.56179 

0.0000 

C(2) 

(-0(1)) 

-0.029131 

0.005153 

-5.653540 

0.0000 

C(3) 

w 

1.097333 

0.086078 

12.74820 

0.0000 

C(4) 

(S) 

1.611726 

0.495352 

3.253699 

0.0012 

R-squared 

0.406423 

(ft  Panel  6:  Wald  Test:  Null  Hypothesis:  C(3)=l 

F-statistic  1.278621  Probability  0.258610 
Chi-square  1.278621  Probability  0.258156 

Exhibit  7.29  Interest  and  Bond  Rates  (Example  7.25) 

Monthly  time  series  of  AAA  bond  rate  and  US  three-month  Treasury  Bill  rate  (a)  and  of  the 
two  series  of  first  differences  (, b ),  two  corresponding  scatter  plots  ((c)  and  (d)),  ADL(1,1)  model 
in  error  correction  form  (Panel  5),  and  test  on  unit  long-run  multiplier  (Panel  6). 
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movements  and  are  non-stationary.  However,  the  two  series  tend  to  stay  near 
to  each  other  in  the  long  run,  so  that  they  possibly  share  a  common  trend. 
The  series  of  first  differences  do  not  contain  trends  anymore.  Clearly,  model¬ 
ling  the  (short-run)  relation  between  the  monthly  changes  in  these  rates  is 
something  different  from  the  possible  (long-run)  relation  between  the  levels 
of  these  two  variables. 

(ii)  Error  correction  model 

We  estimate  an  ADL(1,1)  model  for  these  data.  We  estimate  this  model  in  the 
error  correction  form  (7.34).  Because  in  the  ADL(1,1)  model  p  =  r  =  1,  the 
two  summations  in  (7.34)  are  dropped  in  this  model.  The  results  are  shown 
in  Panel  5  of  Exhibit  7.29.  The  long-run  elasticity  X  in  (7.34)  is  estimated  as 
X  =  1.097.  The  E-test  on  the  restriction  of  a  long-run  elasticity  equal  to  one 
(X  =  1)  in  Panel  6  has  P  =  0.26,  so  that  this  hypothesis  is  not  rejected.  The 
coefficient  of  the  error  correction  term  in  (7.34)  is  —(f)(1)  =  —0.029  (see 
Panel  5).  This  differs  significantly  from  zero,  and  the  negative  sign  means 
that  deviations  from  equilibrium  are  corrected. 

(iii)  Interpretation  of  the  model 

The  above  results  suggest  that  the  two  series  may  be  cointegrated  and  (as 
2=1)  that  a  shift  in  the  level  of  the  three-month  Treasury  Bill  rate  leads,  in 
the  long  run,  to  an  equally  large  shift  in  the  level  of  the  AAA  bond  rate. 
However,  these  are  only  tentative  conclusions,  because  the  results  could  be 
spurious  if  the  variables  are  not  cointegrated.  In  the  next  section  we  will 
perform  a  statistical  test  for  the  presence  of  cointegration  for  these  two  series 
(see  Example  7.27),  and  we  will  reconsider  the  nature  of  the  equilibrium 
mechanism. 


7.5.4  Summary 

In  this  section  we  have  considered  the  modelling  of  one  time  series  variable 
in  terms  of  another  time  series  variable.  This  involves  a  combination 
of  regression  models  (now  with  lagged  dependent  and  lagged  explana¬ 
tory  variables  added)  and  univariate  time  series  models  (now  with  ex¬ 
planatory  variables  added). 

•  The  autoregressive  model  with  distributed  lags  can  be  estimated  and 
evaluated  in  the  usual  way,  provided  that  the  time  series  are  stationary 
and  that  the  explanatory  variables  are  exogenous. 

•  This  model  can  be  written  in  error  correction  form,  with  interesting 
interpretations  in  terms  of  long-run  equilibria  between  variables  and 
adjustments  in  case  the  variables  are  out  of  equilibrium. 
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•  If  the  time  series  contain  stochastic  trends,  then  conventional  methods 
may  lead  to  spurious  results.  This  can  be  prevented  by  differencing  the 
variables  until  they  are  stationary.  This  should  be  done  only  if  the 
variables  are  not  cointegrated,  as  otherwise  the  model  misses  long-run 
equilibrium  effects. 
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7.6  Vector  autoregressive  models 

Uses  Chapters  1-4;  Sections  5.5,  5.7;  Sections  7. 1-7.3,  7.5;  Appendix  A. 5. 


7.6.1  Stationary  vector  autoregressions 

Joint  model  for  multiple  observed  variables 

In  the  foregoing  section  we  discussed  regression  models  with  lags  to  explain 
the  dependent  variable  yt  in  terms  of  an  explanatory  variable  xt.  A  crucial 
assumption  in  such  models  is  that  the  variable  xt  is  exogenous  —  stated 
intuitively,  that  it  does  not  depend  on  yt.  Otherwise  the  parameters  are  not 
estimated  consistently  and  standard  procedures  for  diagnostic  testing  and 
forecasting  are  not  valid  anymore.  This  was  discussed  in  Section  5.7,  to 
which  we  refer  for  further  background  on  endogenous  regressors.  If  the 
variable  xt  is  endogenous  (so  that  it  depends  on  yt),  then  we  can  try  to 
make  a  joint  model  for  the  two  variables  xt  and  yt,  so  that  we  get  two 
equations.  Such  models  are  called  multiple  equation  models.  In  Section  7.6 
we  discuss  the  extension  of  univariate  autoregressive  models  to  the  case  of 
more  than  one  endogenous  variable.  We  consider  stationary  time  series  in 
Sections  7.6.1  and  7.6.2  and  time  series  with  trends  in  Section  7.6.3.  In 
Section  7.7  we  briefly  discuss  three  other  types  of  multiple  equation  regres¬ 
sion  models  —  namely,  seemingly  unrelated  regressions  in  Section  7.7.2, 
models  for  panel  data  in  Section  7.7.3,  and  simultaneous  equation  models 
in  Section  7.7.4. 

Importance  of  correct  choice  of  endogenous  variables 

To  illustrate  the  possible  danger  of  neglecting  the  endogeneity  of  explanatory 
variables  we  consider  a  simple  example.  Suppose  that  the  variables  yt  and  xt 
are  generated  by  the  model 

yt  =  Ht-i  +  'it ,  xt  =  yyt-i  +  ujt. 

We  assume  that  0  <  (/>  <  1  and  y  ^  0  and  that  ;/f  and  u>t  are  independent 
white  noise  processes.  Then  yt  is  an  AR(1)  process  that  is  independent  of  xt, 
and  xt  depends  on  the  past  of  yt.  In  practice  we  do  not  know  the  DGP,  and  it 
may  be  that  we  are  interested  to  see  whether  the  variable  yt  can  be  explained 
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in  terms  of  xt.  Suppose  that  we  (wrongly)  assume  that  xt  is  exogenous  in  the 
regression  model  yt  =  fSxt  +  st  and  that  we  estimate  /I  by  OLS.  In  large 
enough  samples  the  regression  coefficient  then  tends  to  plim (b)  = 
CO v(xt,  yt)/var(xt)  ^  0,  because  cov(xf,  yt )  =  cov(yy{_i  +  cot,  <\>yt- \  +  i]t)  = 
y</>var(yt_i)  ^  0.  So,  if  the  endogeneity  of  xt  is  neglected,  then  this  regression 
gives  the  wrong  impression  that  the  variable  xt  would  affect  the  variable  yt. 
This  is  caused  by  the  fact  that  the  regressor  yt-\  is  omitted  in  the  regression 
model  for  yb  whereas  the  wrong  regressor  xt  (that  is  correlated  with  yt~i)  is 
included  instead  of  yt~\. 


Vector  autoregressive  model  of  order  1 

Of  course,  in  practice  we  do  not  know  the  data  generating  process.  If  the 
explanatory  variable  is  possibly  endogenous,  it  is  better  to  start  with  a  model 
that  contains  equations  for  both  variables.  In  the  above  example  the  equa¬ 
tions  of  the  DGP  can  be  written  in  matrix  form  as 


( 


xt 

yt 


) 


In  practice  we  do  not  know  the  parameter  restrictions  in  this  model,  but  we 
can  estimate  the  parameters  in  the  unrestricted  model 


(x*\  =  ( at  \  /</>n  </>u\  ( xt-i\  / eu\ 

\yt)  W  U21  d>2JU-J  UJ’ 

-WH.i''"  11,2 ) 

Elt  J  \\0 )  \(Jl2  (722  ) 


We  use  the  following  notation.  Usually  capital  letters  denote  matrices,  but  in 
multiple  equation  models  we  will  denote  vectors  of  variables  also  by  capital 
letters  —  for  instance  Yt,  to  distinguish  this  from  single  variables  like  yt.  Let 
Yt  denote  the  2x1  vector  (xt,  yt)',  «  the  2x1  vector  (cq,  0.2)',  £t  the  2x1 
vector  (£1 1,  E2t)\  CL  the  2x2  matrix  of  AR  coefficients,  and  U  the  2x2 
covariance  matrix  of  the  disturbance  terms.  Then  the  above  model  can  be 
written  as 


Yt  =  oc  +  +  et,  st  ~  IID(0,  n).  (7.35) 

This  is  called  a  vector  autoregressive  (VAR)  model  of  order  1,  because  it  is  a 
direct  generalization  of  the  univariate  AR(1)  model  to  the  case  of  a  vector  of 
variables.  The  VAR(l)  model  for  m  variables  is  defined  in  a  similar  way,  in 
which  case  Yt  is  the  m  x  1  vector  of  variables,  a  is  a  m  x  1  vector  of 
constants,  and  and  fl  are  m  x  m  matrices. 
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Stationary  VAR(1)  process 

AVAR  process  Yt  is  called  stationary  if  it  has  a  constant  vector  of  means  E[Yt]  and 
a  finite  and  constant  covariance  matrix  var(  Yd  =  £[(  Yf  —  E[Yf])(  Yf  —  £[Y?])'], 
and  if  the  autocovariance  matrices  cov(Y(,  Y(_*,)  depend  only  on  the  lag  k  and 
not  on  the  time  t.  For  the  case  of  a  single  variable  (m  =  1),  it  was  shown  in  Section 
7.1.3  that  the  AR(1)  process  is  stationary  if  and  only  if  —  1  <  0  <  1.  In  the 
multivariate  model  with  m  >  1  variables  the  stationarity  condition  is  that  $  has 
all  its  eigenvalues  within  the  unit  circle.  To  clarify  this  result,  we  rewrite  the 
VAR(l)  model  by  repetitive  substitution  of  the  equation  (7.35)  as 


t—2  t—2 

Y,  =  c^-1  Yj  +  J2  'FA  +  <&£H. 

i= o  /= o 

The  effects  of  the  starting  values  Yi  and  of  the  disturbances  die  out  over  time  if 
and  only  if  T>'  — >  0  for  j  — >  oo.  This  is  equivalent  to  the  condition  that  has  all  its 
eigenvalues  within  the  unit  circle.  In  this  case  the  mean  and  variance  of  the  process 
Yt  are  obtained  from  the  above  equation  (for  t  — *  oo),  so  that 

OO  OO 

E[Yt]  =  =  *)“  V  var (Yt)  =  ^  <^>,^(<^>'),. 

j—0  j—0 

The  autocovariance  matrix  at  lag  k  >  0  is  equal  to  co v(Yf,  Y ,_*)  = 
o  4>fc+/'ft(<5,)/'.  The  stationarity  condition  can  also  be  expressed  in  terms  of 
the  polynomial  matrix  <T(z)  =  I  —  <3>j  for  the  VAR(l)  model  (7.35)  —  that  is, 
< t>(L)Yt  =  a  +  et  where  LYt  =  Y(_i.  The  stationarity  condition  is  that  all  the  m 
eigenvalues  of  the  matrix  $  lie  inside  the  unit  circle  —  that  is,  all  solutions  of  the 
equation  det(<F  —  II)  =  0  should  satisfy  |A|  <  1.  The  roots  of  the  polynomial 
matrix  <F(z)  are  the  solutions  of  the  equation  det(<&(z))  =  det(I  —  $>z)  = 
(  —  z)mdet(<F  —  x-1/)  =  0.  So  the  roots  of  <t>(z)  are  the  inverses  of  the  eigenvalues 
X,  and  the  stationarity  condition  is  that  all  roots  of  the  VAR  polynomial  $>(z)  lie 
outside  the  unit  circle.  This  generalizes  the  stationarity  condition  of  Section  7.1.3 
for  univariate  AR (p)  processes. 


Derivation  of  implied  univariate  ARMA  processes 

If  Yt  follows  a  stationary  VAR(  1 )  process,  then  each  of  the  individual  components 
of  Yt  follows  a  univariate  ARMA(/«,  m  —  1)  process.  To  show  this,  we  use 
some  results  of  Appendix  A. 5  on  matrices.  For  simplicity  we  assume  that  a  =  0 
and  we  write  the  VAR(l)  model  as  (/ —  <I,£)Y«  =  ef.  Let  C(z)  be  the  mxm 
matrix  of  cofactors  of  the  matrix  (I  —  3>z).  The  elements  of  the  matrix  C(z) 
consist  of  determinants  of  (m  —  1)  x  (m  —  1)  submatrices  of  (I  —  <Fz),  so  that 
they  are  polynomials  in  z  of  degree  (at  most)  (m  —  1).  The  matrix  C(z)  further 
has  the  property  that  C(z)(I  —  ^z)  =  det(I  —  $>z)I  =  d(z)I,  where  the  (scalar) 
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polynomial  d(z)  is  the  determinant  of  (I  —  <£>z),  which  has  degree  (at  most)  m.  If 
we  premultiply  the  model  (I  —  <f>L) Yt  =  st  by  C(L),  then  using  C(z)(I  —  ^z)  = 
d(z)I  we  obtain 


d(L)Yt  =  C(L)st. 

Note  that  d(z)  is  a  scalar  polynomial  and  C(z)  is  a  matrix  polynomial.  In  each  of 
the  ni  equations,  the  right-hand  side  is  an  MA  process  of  order  (at  most)  (m  —  1) 
(because  all  elements  of  C(z)  have  degree  at  most  (m  —  1) ),  and  the  left-hand  side 
is  an  AR  expression  (of  order  at  most  m)  in  a  single  variable.  Therefore,  each  of  the 
components  of  Yt  follows  an  ARMA(»7,  m  —  1)  process  (the  orders  may  be  lower 
if  $  satisfies  certain  parameter  restrictions).  The  results  in  Section  7.1.3  show  that 
the  univariate  processes  are  stationary  if  and  only  if  the  (scalar)  AR  polynomial 
d(z)  =  det(I  —  <I>z)  has  all  its  roots  outside  the  unit  circle.  As  we  concluded  before, 
this  is  equivalent  to  the  condition  that  $  has  all  its  eigenvalues  within  the  unit 
circle. 


Illustration  for  VAR(1)  process  with  two  variables 

As  an  illustration  of  the  above  technical  results,  we  consider  the  case  of  m  =  2 
variables  Yt  =  (xt,  yt)1  with  disturbance  terms  £t  =  (u>t,  r\t)' .  In  this  case  the  matrix 
of  cofactors  of  (I  —  <f>z)  is  equal  to 


C(z) 


(  1  -  0 22Z  <t>nz  \ 
V  021*  1  —  011  Zj 


The  determinant  of  (I  —  $?:)  is  equal  to  d(z)  =  (1  —  <pnz)(l  —  4>22z)  ~  ^n^nZ2 
=  1  —  (cj)11  +  (f>  2i)z  +  (0H02.2  —  Hence  the  implied  univariate 

models  —  that  is,  the  two  components  of  d(L)Yt  =  C(L) et  —  become 

xt  =  (011  +  <t>2l)xt- 1  +  (012021  —  011022)xf-2  +  62S  —  0 22wf-l  +  012,/«-l?  ,n 

(  /  .30) 

yt  =  (011  +  022 )Tr—  1  +  (012021  —  011022)77-2  +  021wf-l  +  >1t  ~  011  ;7r—  l  • 


In  both  equations  the  composite  error  term  is  uncorrelated  for  lags  2  and  larger,  so 
that  this  is  an  MA(1)  process,  and  the  autoregression  involves  two  lags.  So  the 
variables  xt  and  yt  follow  ARMA(2,1)  processes,  and  for  some  parameter  restric¬ 
tions  on  $  the  orders  can  be  lower.  The  processes  xt  and  yt  are  stationary  if  and 
only  if  the  AR  polynomial  equation  1  —  (0n  +  022)z+  (0i202i  —  0n022)z2  = 
det(I  —  $2:)  =  0  has  both  its  solutions  outside  the  unit  circle.  This  is  equivalent 
to  the  condition  that  det(<b  —  77)  has  both  its  solutions  inside  the  unit  circle.  This 
shows  once  more  that  stationarity  is  equivalent  to  the  condition  that  the  two 
eigenvalues  of  lie  inside  the  unit  circle. 
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Stationary  VAR (p)  process 

The  VAR(l)  model  can  be  extended  to  the  VAR (p)  model  by  incorporating 
additional  lags  so  that 


Yt  =  oc  +  ^1Yt_1+^2Yt-2  +  ---+%Yt_p  +  £t,  e*  ~  IID(0,  ft).  (7.37) 

Here  d>7  (/  =  1,  •  •  • ,  p)  and  O  are  m  x  m  matrices.  The  VAR(p)  process  Yt  is  station¬ 
ary  if  it  has  constant  vector  of  means  and  constant  autocovariances  cov( Yt,  Yt_/l) 
that  depend  on  the  lag  k  but  not  on  the  time  t.  Stationarity  is  equivalent  to  the 
condition  that  the  characteristic  polynomial  det(<E(z))  has  all  its  roots  outside 
the  unit  circle,  where  d>(z)  is  the  m  x  m  polynomial  matrix  d>(z)  =  I  —  dqz 

- —  <t>pZp.  Under  this  condition  each  of  the  individual  variables  is  a  stationary 

ARMA  process  with  AR  order  (at  most)  nip  and  MA  order  (at  most)  (m  —  1  )p.  The 
mean  of  the  process  Yt  is  equal  to  p  =  (I  —  YA=\  d);)_1 a  =  5>(l)_1a.  Note  that  $(1) 
is  an  invertible  matrix  if  the  characteristic  roots  of  $(2;)  all  lie  outside  the  unit  circle. 


□  Derivation  of  vector  error  correction  model 

A  stationary  VAR(p)  process  can  be  written  in  error  correction  form.  For  the 
VAR(l)  model  (7.35)  with  a,  =  (I  —  d>)/(  we  can  write 


AY*  =  (<&  —  I)(Yt-i  —  p)  +  st. 


This  shows  that  deviations  of  Y*_j  from  the  long-run  mean  p  are  corrected  by  the 
multiplier  matrix  (<!>  —  /)  =  — d>(l),  where  <f>(z)  =  (7  —  dV)  is  the  VAR  polynomial 
of  the  model.  This  shows  some  similarity  with  the  error  correction  representation 
(7.33)  of  ADL  models  in  Section  7.5.1,  with  the  difference  that  now  all  m 
variables  are  corrected  simultaneously.  The  VAR(p)  model  can  be  written  in  a 
similar  form,  as  follows.  The  matrix  polynomial  <E>(z)  —  <b(l)z  is  the  zero  matrix 
for  z=l,  which  implies  that  it  can  be  factorized  as  d>(z)  —  $(1)2  =  (1  —  z)T(z). 
Here  T(z)  is  a  m  x  m  polynomial  matrix  of  order  (p  —  1),  and  the  value  at  2  =  0  is 
equal  to  T(0)  =  d>(0)  —  <E>(  1 )  •  0  =  7.  So  we  can  write  T(z)  =  I  —  Y^=i  r jZ*.  With 
this  factorization,  and  using  the  fact  that  $(2)  =  d>(l)2+  (1  —  z)T(z),  the  VAR 
model  d>(L)Yf  =  a  +  st  can  be  written  as  d>(l)Y,_i  +  AY*  —  r;-A Y*-_7-  =  a  +  e*. 
Here  a  =  d>(l)/t,  where  p  is  the  vector  of  means  of  the  process  Y*,  and  by  rearran¬ 
ging  terms  we  obtain 

P- 1 

AY*  =  -<D(1)(Y*_!  -p)  +  Y, FAYH  +  (7.38) 

/= 1 

This  shows  that  the  deviations  of  Y*_  1  from  the  equilibrium  value  p  are  corrected 
again  by  the  multiplier  matrix  — d>(l).  The  model  written  in  this  form  is  called  a 
vector  error  correction  model  (VECM).  It  shows  some  similarities  with  the  error 
correction  model  (7.34),  with  the  difference  that  in  general  all  the  m  variables  are 
affected  by  the  correction  process. 


Exercises:  T:  7.9c,  d;  S:  7.15a,  b. 
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7.6.2  Estimation  and  diagnostic  tests  of  stationary 
VAR  models 

Estimation  of  stationary  VAR  models 

In  the  foregoing  section  we  showed  that  the  individual  variables  of  a  VAR(p) 
process  follow  (univariate)  ARMA  models.  One  could  estimate  these  implied 
ARMA  models  by  means  of  the  techniques  discussed  in  Section  7.2.2,  but  this 
is  not  efficient.  The  VAR(p)  model  contains  m  +  pm 2  regression  parameters, 
whereas  the  m  univariate  ARMA(mp,  (m  —  l)p)  models  (each  with  a  constant 
term)  contain  in  total  m  +  pm2  +  pm(m  —  1)  parameters.  The  difference 
arises  because  the  estimation  of  models  for  the  individual  univariate  series 
neglects  the  cross  equation  parameter  restrictions  between  the  different  uni¬ 
variate  ARMA  models.  For  instance,  in  the  foregoing  section  we  saw  that  the 
AR  polynomial  is  the  same  for  each  of  the  m  individual  time  series.  Moreover, 
this  univariate  approach  also  neglects  the  possible  cross  equation  correlations 
between  the  error  terms  st  in  case  the  covariance  matrix  ft  is  not  diagonal. 
Efficient  estimates  are  obtained  by  applying  ML  to  the  system  of  m  equations 

(7.37) .  Suppose  that  the  disturbance  terms  are  normally  distributed  so  that 
Sf  ~  NID(0,  ft) — that  is,  st  follows  the  m-dimensional  multivariate  normal 
distribution.  The  density  p(st)  of  this  distribution  is  given  in  (1.21)  in  Section 

1.2.3  (p.  31),  and  it  follows  that  log (p(et))  =  —  ™ log  (2n)  -  jlog  (det(ft)) 
—  -(-fijft^fif.  Therefore  the  conditional  log-likelihood  of  the  VAR(p)  model 

(7.37)  (treating  the  initial  values  as  fixed)  is  equal  to 

log(L)  =  2~~~~  log  (2^)  ~~2~~log  (det(ft)) 

- 1 E  Y<  - » -  E  Y*-d  r.  - « -  E  y-i 

It  is  left  as  an  exercise  (see  Exercise  7.10)  to  show  that  ML  in  this  model  is 
equivalent  to  applying  OLS  on  each  of  the  m  equations  in  (7.37)  separately. 
So  the  estimation  of  VAR  models  is  quite  straightforward.  The  covariance 
matrix  ft  can  be  estimated  by  ft  =  YTt^p+i  ete\ >  where  et  is  the  m  x  1 

vector  of  residuals  at  time  t.  If  some  elements  of  the  parameter  matrices 
are  restricted  in  some  way,  then  OLS  is  still  consistent  but  not  efficient 
anymore,  and  in  large  enough  samples  ML  provides  more  accurate  estimates 
in  this  case. 

Model  selection 

The  OLS  or  ML  estimators  have  the  usual  asymptotic  statistical  properties, 
provided  that  the  VAR  process  is  stationary.  So  we  can  apply  conventional 
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t-  and  F- tests  on  the  significance  of  the  coefficients.  For  instance,  the  lag  order 
p  of  the  VAR  model  can  he  selected  by  applying  F- tests  or  LR- tests  on  the 
significance  of  additional  lags.  These  tests  follow  (asymptotically)  the  stand¬ 
ard  F-  and  ^-distributions  respectively.  Another  method  to  select  the  lag 
order  is  by  minimizing  the  AIC  or  SIC,  which  are  defined  for  VAR (p) 
models  by 


AIC(p)  =  log(det(Ylp^  +2^-,  SIC(p)  =  log(det(iip))  +log(«)^-. 

Here  pm 2  is  the  total  number  of  coefficients  of  lagged  regressors  in  the 
VAR(/?)  model  and  tlp  is  the  estimated  covariance  matrix  of  the  disturbances 
in  the  VAR (p)  model.  In  practice  the  order  p  of  VAR  models  is  often  chosen 
relatively  small,  as  otherwise  the  number  of  parameters  pm2  of  lagged  terms 
quickly  becomes  large. 

VAR  models  are  less  useful  for  large  numbers  of  variables.  The  total 
number  of  parameters  in  the  VAR(p)  model  for  m  variables  is  m  +  pm 1  + 
jm{m  +  1)  —  namely,  m  for  the  vector  of  constants  a,  pm 1  for  the  m  x  m  AR 
matrices  T>;  (/'  =  1,  •  •  •  ,p ),  and  jm(m  +  1)  for  the  symmetric  m  x  m  matrix 
il.  Estimation  becomes  infeasible  for  large  numbers  m  of  variables,  because 
the  number  of  parameters  increases  with  the  square  m2.  This  is  called  the 
curse  of  dimensionality  of  multiple  equation  models. 

Exogenous  variables  and  model  simplification 

In  practice  VAR  models  are  used  only  for  a  small  number  of  variables,  with 
m  =  2  or  m  =  3  in  many  and  m  <  10  in  almost  all  applications.  One  method 
to  reduce  the  number  of  parameters  is  by  considering  the  possible  exogeneity 
of  some  of  the  variables.  For  instance,  in  our  example  at  the  beginning  of 
Section  7.6.1,  the  variable  yt  is  exogenous  in  the  equation  for  xt,  because  in 
the  bivariate  VAR(l)  model  (7.35)  there  holds  </>2 1  =  0  and  au  =  0.  In  this 
case  the  effect  of  yt  on  xt  can  be  estimated  simply  by  regressing  xt  on  yt  and 
neglecting  the  time  series  model  for  yt.  This  reduces  the  number  of  param¬ 
eters  from  in  total  nine  for  the  VAR(  1 )  model  (two  for  a,  four  for  <E,  and  three 
for  il)  to  four  for  the  regression  model  (the  constant,  the  two  slope  param¬ 
eters  (j)n  and  4>u,  and  the  variance  of  the  disturbance  term).  This  may 
improve  the  finite  sample  efficiency  of  the  estimators.  More  generally,  con¬ 
sider  the  VAR(/y)  model  where  the  m  variables  are  split  in  two  groups, 
denoted  by  Yt  and  Xt,  so  that 

(*n (L)  ®u(L)\ (  Yt\  ( «i\  / eu\ 

\  ch2i  (T)  T>22(T)  j \Xt  J  \ac2  J  \e2t  J 
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Then  the  variables  Xt  are  exogenous  in  the  equations  for  Yt  if 

On(z)  =  0  and  fli2  =  0, 

where  Fin  is  the  cross  correlation  matrix  between  E\t  and  sit.  Indeed,  under 
these  assumptions  the  process  Xt  is  described  by  the  VAR  model 
QuiLjXt  =  0C2  +  sit,  which  is  uncorrelated  with  the  process  E\t  at  all  lags. 
Therefore  the  regressors  <f>i2(T)X<  in  the  equations  On(L)Yj  =  cq  —  O12 (L)Xt 
+  £\t  are  uncorrelated  with  the  disturbance  terms  £\t,  so  that  these  regressors 
are  exogenous  in  these  equations.  Consistent  estimates  are  then  obtained  by 
regressing  Yt  on  the  appropriate  lags  of  Yt  and  Xt .  This  reduces  the  number  of 
involved  parameters  considerably.  If  $21(2)  =  0,  then  it  is  also  said  that  Yt 
does  not  cause  Xt.  The  corresponding  F- test  on  the  parameter  restrictions 
O21  {z)  =  0  is  called  the  Granger  causality  test. 

Diagnostics  for  stationary  VAR  models 

Diagnostic  tests  of  VAR  models  can  be  performed  in  a  similar  way,  as  was 
discussed  in  Section  7.2.4  for  univariate  AR  models.  Since  the  main  purpose 
of  the  VAR  model  is  to  express  the  dynamic  correlations  between  the  vari¬ 
ables,  it  is  of  particular  importance  to  check  whether  the  m  residual  series  are 
white  noise.  A  simple  check  is  to  perform  tests  on  serial  correlation  for  the  m 
residual  series  separately.  Note  that  the  model  allows  for  cross  equation 
correlations  between  the  residuals  of  different  equations  at  the  same  time 
(if  the  off-diagonal  elements  of  fl  are  non-zero),  but  the  residuals  should  not 
be  correlated  with  each  other  at  different  time  moments. 

One  can  apply  further  tests  on  the  individual  equations  —  for  instance,  on 
possible  parameter  breaks,  outliers,  and  ARCH  effects.  Competing  models 
can  be  compared  by  their  forecast  performance.  For  instance,  for  a  VAR(l) 
model  the  one-step-ahead  forecast  is  given  by  Yn+\  =  a  +  OY„  with  covar¬ 
iance  matrix  of  the  forecast  error  equal  to  FL.  The  two-step-ahead  forecast  is 
Y„+2  =  a  +  $Y„+i  =  (I  +  0)a  +  O2  Y„,  and  the  corresponding  covariance 
matrix  is  E[(Y„+2  —  Y„+2)(Y„+2  —  Y„+ 2)']  =  D  +  ODO'.  Similar  expressions 
can  be  derived  for  higher  order  VAR  models  and  longer  forecast  horizons. 

Example  7.26:  Interest  and  Bond  Rates  (continued) 

We  continue  our  analysis  of  the  interest  and  bond  rates  data.  In  Example 
7.22  it  was  assumed  that  the  changes  in  the  Treasury  Bill  rate  are  exogenous 
in  the  model  for  the  changes  of  the  AAA  bond  rate.  Now  we  will  discuss  (i) 
the  motivation  for  a  vector  autoregressive  (VAR)  model  for  these  data,  (ii)  the 
estimation  and  selection  of  a  VAR  model,  (iii)  a  test  on  the  exogeneity  of  the 
Treasury  Bill  rate,  and  (iv)  some  diagnostic  checks  on  the  model. 


664  7  Time  Series  and  Dynamic  Models 


(i)  Motivation  for  a  VAR  model 

In  Example  7.22  we  estimated  an  ADL  model  for  the  (endogenous)  changes  in 
AAA  bond  rate  in  terms  of  the  (supposedly  exogenous)  changes  in  the  three- 
month  Treasury  Bill  rate.  However,  as  we  remarked  at  the  end  of  Example 
7.22,  the  changes  in  the  Treasury  Bill  rate  are  possibly  correlated  with  the 
disturbance  term  in  this  equation.  This  is  because  the  unobserved  factors  that 
influence  the  AAA  bond  rate  may  very  well  also  influence  the  Treasury  Bill 
rate.  This  was  analysed  in  Example  5.33  (p.  414-16),  where  we  concluded 
that  the  Treasury  Bill  rate  is  indeed  endogenous.  Therefore  we  now  estimate 
VAR  models  that  treat  the  changes  in  both  the  Treasury  Bill  rate  and  the  AAA 
bond  rate  as  endogenous  variables. 

(ii)  Estimation  and  selection  of  a  VAR  model 

Panel  1  of  Exhibit  7.30  shows  the  first  six  autocorrelations  of  both  variables 
and  also  the  cross  correlations  between  the  two  variables.  Some  of  the  lagged 
effects  are  significant,  but  it  is  not  so  easy  to  decide  on  the  order  of  a  VAR 
model  on  the  basis  of  these  autocorrelations.  We  estimate  VAR  models  for 
orders  p  =  1,2,  and  3.  The  resulting  estimates  are  in  Panels  2-4  of  Exhibit 
7.30  (the  t-values  are  in  parentheses).  The  Schwarz  criterion  selects  the 
VAR(2)  model.  By  comparing  the  log-likelihoods  we  can  test  the  null  hypoth¬ 
esis  of  the  VAR(2)  model  against  the  alternative  of  the  VAR(3)  model,  which 
corresponds  to  four  restrictions.  The  corresponding  LR- test  has  value 
LR  =  2(  —  95.77  +  99.97)  =  8.4  with  a  P-value  according  to  the  /2(4)  dis¬ 
tribution  of  P  =  0.078.  This  means  that  the  VAR(2)  model  is  not  rejected  (at 

5  per  cent  significance  level,  this  is  sufficiently  convincing  for  the  relatively 
large  sample  size  of  n  =  600). 

(iii)  Test  on  exogeneity  of  the  Treasury  Bill  rate 

The  VAR(2)  model  can  be  used  to  check  for  the  possible  exogeneity  of  the 
Treasury  Bill  rate  in  the  equation  for  the  AAA  bond  rate.  In  the  VAR(2)  model, 
this  corresponds  to  the  following  three  restrictions.  The  two  coefficients  of  the 
AAA  bond  rate  (with  lags  1  and  2)  in  the  equation  for  the  Treasury  Bill  rate  are 
zero,  and  the  disturbances  of  the  two  equations  are  uncorrelated.  The  results 
in  Exhibit  7.30  can  be  used  to  test  for  these  restrictions.  In  Panel  3  of  Exhibit 
7.30,  the  t-values  of  the  two  lagged  AAA  bond  rate  terms  in  the  equation  for 
the  Treasury  Bill  rate  are  6.94  and  —4.94.  The  Granger  causality  test  in  Panel 

6  —  that  is,  the  P-test  on  the  joint  significance  of  the  two  coefficients  —  gives 
P  =  28.9  with  P  =  0.0000.  So  we  strongly  reject  the  null  hypothesis  that  the 
AAA  bond  rate  does  not  affect  the  Treasury  Bill  rate.  Further,  the  cross 
correlation  between  the  two  series  of  residuals  is  significant  with  a  value  of 
0.54  (see  Panel  5).  We  conclude  that  the  Treasury  Bill  rate  is  not  exogenous. 
This  is  in  line  with  our  earlier  conclusion  in  Example  5.33  (p.  414-16). 
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Panel  1:  autocorrelations  and  cross  correlations 

Sample:  1950:01  1999:12;  Included  observations:  600 

Lag 

DAAA 

DUS3MT 

DAAA  with 
past  DUS3MT 

DUS3MT  with 
past  DAAA 

0 

1.000 

1.000 

0.6086 

0.6086 

1 

0.371 

0.273 

0.1523 

0.3377 

2 

-0.087 

-0.106 

-0.0329 

-0.0964 

3 

-0.084 

-0.089 

-0.0763 

-0.1248 

4 

0.042 

-0.036 

0.0577 

-0.0451 

5 

0.151 

0.044 

0.1496 

0.0683 

6 

0.010 

-0.183 

-0.0975 

-0.1024 

Panel  2:  VAR(l)  model 

Sample:  1950:01  1999:12;  Included  observations:  600 

Variable 

Eq.  for  DAAA 

Eq.  for  DUS3MT 

DAAA(-l) 

0.442465 

0.605146 

t-value 

(9.27999) 

(5.64462) 

DUS3MT(— 1) 

-0.052705 

0.106805 

t-value 

(-2.45113) 

(2.20907) 

C 

0.005125 

0.001369 

t-value 

(0.62954) 

(0.07477) 

R-squared 

0.146332 

0.121339 

S.E.  equation 

0.199278 

0.448077 

Log  Likelihood 

-133.8794 

Akaike  Information  Criterion 

0.466265 

Schwarz  Criterion 

0.510234 

Panel  3:  VAR(2)  model 

Sample:  1950:01  1999:12;  Included  observations:  600 

Variable 

Eq.  for  DAAA 

Eq.  for  DUS3MT 

DAAA(-l) 

0.529146 

0.747457 

t-value 

(11.0142) 

(6.93980) 

DAAA(— 2) 

-0.318410 

-0.547129 

t-value 

(-6.44597) 

(-4.94054) 

DUS3MT(-1) 

-0.040867 

0.165415 

t-value 

(-1.91532) 

(3.45804) 

DUS3MT(— 2) 

0.047549 

-0.051966 

t-value 

(2.24003) 

(-1.09198) 

C 

0.006676 

0.004685 

t-value 

(0.84884) 

(0.26573) 

R-squared 

0.206751 

0.188312 

S.E.  equation 

0.192419 

0.431385 

Log  Likelihood 

-99.97261 

Akaike  Information  Criterion 

0.366575 

Schwarz  Criterion 

0.439858 

Exhibit  7.30  Interest  and  Bond  Rates  (Example  7.26) 

Autocorrelations  and  cross  correlations  for  the  series  of  changes  in  the  AAA  bond  rate  and  the 
series  of  changes  in  the  three-month  Treasury  Bill  rate  (Panel  1),  VAR(l)  model  (Panel  2),  and 
VAR(2)  model  (Panel  3). 
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Panel  4:  VAR(3)  model  (summary) 

Log  Likelihood 

-95.77027 

Akaike  Information  Criterion 

0.365901 

Schwarz  Criterion 

0.468496 

Panel  5 

Covariances  residuals 

Correlations  residuals 

DAAA 

DUS3MT 

DAAA 

DUS3MT 

DAAA 

DUS3MT 

0.036716 

0.044631 

0.044631 

0.184542 

1.000000 

0.542195 

0.542195 

1.000000 

Panel  6:  Pairwise  Granger  Causality  Tests  in  VAR(2)  model 

Sample:  1950:01  1999:12 _ 

Null  Hypothesis:  Obs  F-Statistic  Probability 

DUS3MT  does  not  Granger  Cause  DAAA  600  3.70414  0.02519 

DAAA  does  not  Granger  Cause  DUS3MT  600  28.8526  1.1E-12 


Panel  7:  autocorrelations  of  residuals  of  VAR(2)  model 
Sample:  1950:01  1999:12;  Included  observations:  600 


Lag 

SACL  Eq  1 
(DAAA) 

Q-Stat 

Prob 

SACF  Eq  2 
(DUS3MT) 

Q-Stat 

Prob 

1 

0.021 

0.2754 

0.600 

0.009 

0.0462 

0.830 

2 

-0.019 

0.5019 

0.778 

-0.019 

0.2534 

0.881 

3 

0.050 

2.0435 

0.563 

0.012 

0.3351 

0.953 

4 

0.011 

2.1115 

0.715 

-0.090 

5.2399 

0.264 

5 

0.131 

12.579 

0.028 

0.063 

7.6752 

0.175 

6 

-0.006 

12.600 

0.050 

-0.181 

27.597 

0.000 

Exhibit  7.30  (C ontd.) 

Summary  of  outcomes  of  VAR(3)  model  (Panel  4),  covariance  matrix  and  correlation  matrix  of 
the  two  residual  series  of  the  VAR(2)  model  (Panel  5),  Granger  causality  tests  (Panel  6),  and 
correlogram  of  the  two  residual  series  of  the  VAR(2)  model  (Panel  7). 


(iv)  Some  diagnostic  checks 

We  check  whether  the  VAR(2)  model  captures  the  correlations  that  are 
present  in  both  series.  Panel  7  of  Exhibit  7.30  shows  the  first  six  autocorrela¬ 
tions  for  the  two  series  of  residuals  of  the  VAR(2)  model.  The  correlations  at 
lags  1-4  are  no  longer  significant,  but  some  correlation  at  lags  5  and  6 
remains  present.  Here  we  analysed  the  relation  between  the  changes  in  the 
two  variables  —  that  is,  the  short-run  relations  between  the  variables.  In  the 
next  section  we  will  consider  the  long-run  relationships  between  the  levels  of 
these  two  variables  (see  Example  7.27). 


Exercises:  T:  7.10d;  S:  7.15c,  d. 
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7.6.3  Trends  and  cointegration 

Multiple  time  series  with  stochastic  trends 

The  analysis  of  VAR  models  in  the  foregoing  section  was  based  on  the  assump¬ 
tion  that  all  m  variables  are  stationary.  If  the  variables  contain  deterministic 
trends,  then  this  can  be  modelled  by  incorporating  appropriate  time  functions 
as  additional  regressors  in  the  VAR  equations.  However,  if  the  variables 
contain  stochastic  trends,  then  the  standard  properties  of  ML  and  related 
tests  are  not  valid  anymore.  This  has  already  been  discussed  for  univariate 
time  series  in  Section  7.3.3.  Further,  regressions  of  variables  with  stochastic 
trends  may  lead  to  spurious  results,  as  was  discussed  in  Section  7.5.3.  As  many 
economic  variables  contain  stochastic  trends,  this  suggests  that  VAR  models 
can  be  applied  only  after  sufficient  differencing  of  the  variables  to  obtain 
stationarity.  This  is  indeed  the  way  to  proceed,  unless  the  variables  are  coin¬ 
tegrated.  In  this  case  the  correlations  between  trended  variables  are  not  spuri¬ 
ous,  as  was  explained  at  the  end  of  Section  7.5.3.  In  this  section  we  consider  this 
situation  in  more  detail.  We  describe  tests  for  cointegration  and  the  modelling 
of  cointegrated  variables  by  means  of  vector  error  correction  models. 

Analysis  of  the  VAR(1)  model  with  two  variables 

To  introduce  the  main  ideas  we  first  consider  the  VAR(l)  model  (without 
constant  terms)  for  two  variables.  The  model  Yt  =  $Y(_i  +  st  can  be  written 
in  the  form  of  the  VECM 

AYf  =  FIYf_i  +  et,  n  = 

In  this  model  there  are  three  cases  of  interest,  according  to  whether  the  rank 
of  the  2x2  matrix  FI  is  0,  1,  or  2.  If  the  two  variables  in  Yt  are  stationary, 
this  means  that  O  has  both  its  eigenvalues  within  the  unit  circle.  This  implies 
that  det(<F  —  I)  =  det(II)  ^  0,  so  that  the  matrix  II  has  rank  2.  On  the  other 
hand,  if  II  has  rank  0,  then  II  =  0  and  hence  A Yt  —  et.  Then  both  variables 
follow  random  walks.  In  this  case  one  says  that  there  exist  two  stochastic 
trends  for  the  two  variables.  The  variables  are  modelled  in  terms  of  their  first 
differences.  A  final  possibility  is  that  II  has  rank  1,  so  that  0  =  det(IT)  = 
det(T>  —  I).  In  this  case  the  matrix  O  has  one  eigenvalue  at  z  =  1  and  another 
eigenvalue  p  ^  1.  As  II  has  rank  1,  this  means  that  the  second  column  is  a 
multiple  of  the  first  column,  so  that  we  can  write 

yai  —  00.2  )  Va2/ 


668  7  Time  Series  and  Dynamic  Models 


where  a  =  (ai,  002/  and  f>  =  (1,  —6).  Let  the  two  variables  in  Yt  be  denoted 
by  Yt  —  (yt,  xt)';  then  the  VECM  becomes 

A yt  =  oti(y*_i  -  Oxt-i)  +  alt, 

A xt  =  cci(yt-i  -  dxt-i)  +  Bit- 

The  rest  of  this  section  is  devoted  to  the  modelling  of  this  kind  of  processes. 

Cointegration  in  the  VAR(1)  model  with  two  variables 

We  analyse  the  above  VECM  in  more  detail.  This  model  corresponds  to  the 
case  that  11  has  rank  1  and  that  the  matrix  O  in  the  corresponding  VAR(l) 
model  has  an  eigenvalue  at  z  =  1.  We  assume  that  the  other  eigenvalue  z  =  p 
is  stable  —  that  is,  that  —  1  <  p  <  1  (if  p  =  1  then  the  process  would  have  two 
unit  roots).  We  will  show  that  in  this  case  (that  is,  with  one  unit  root  and  one 
stable  root)  the  individual  variables  yt  and  xt  contain  a  stochastic  trend,  but 
that  (yt~  1  —  0xt- 1)  is  stationary,  so  that  the  two  variables  are  cointegrated. 
The  fact  that  the  variables  yt  and  xt  are  not  stationary  can  be  derived  as 
follows.  The  implied  ARMA(2,1)  models  for  yt  and  xt  are  given  in  (7.36), 
with  AR  polynomial  d(z)  =  det(I-<f>z)=2:2det(2:_1I-<I>)=2:2(2T1-l)(2r1-p) 
=  (z  —  l)(z  —  p).  So  both  time  series  have  a  unit  root  —  that  is,  yt  and  xt 
contain  a  stochastic  trend.  Since  we  assumed  that  —  1  <  p  <  1,  it  follows  that 
yt  and  xt  are  both  ARIMA(  1,1,1)  processes.  On  the  other  hand, 
(yt-i  —  9xt-i)  is  stationary,  which  can  be  seen  as  follows.  Because 
n  has  rank  1,  it  follows  that  ai  yf  0  or  a2  yf  0  (or  both).  The  above  VECM 
then  shows  that  the  linear  combination  (yt_ \  —  9xt-i)  is  stationary, 
because  it  can  be  expressed  in  terms  of  A yt,  Axt,  E\t,  and  sn,  which  are  all 
stationary. 

The  above  results  show  that  the  series  yt  and  xt  are  cointegrated  in  this 
case.  The  relation  (yt  —  6xt)  is  called  the  cointegration  relation ,  and  yt  =  6xt 
is  the  long-run  equilibrium  relation  between  the  two  variables.  The  param¬ 
eters  ai  and  cc2  are  called  the  adjustment  coefficients.  They  describe  how  yt 
and  xt  are  adjusted  if  the  variables  are  out  of  equilibrium.  For  instance,  if 
ai  <  0  and  yt-\  >  f>xt- 1,  then  this  leads  to  a  downward  adjustment  of  yt  in 
the  direction  of  equilibrium. 


Cointegration  in  the  VARfpj  model 

Similar  results  hold  true  for  VAR(p)  models  for  m  variables.  Let  II  =  — <b(l), 
where  <T>(  1 )  is  the  m  x  m  matrix  obtained  by  substituting  z  =  1  in  the  VAR 
polynomial  <t>(z)  of  the  model.  We  rewrite  the  VECM  (7.38)  as  follows,  where 
we  define  y  =  <E(1  )p: 


A Yf  —  y  +  nV(_,  +  TiAY^i  +  •  •  •  +  Tp^\k.Yt-p+\  +  et,  t  —  p  +  1,  •  •  • , n.  (7.39) 
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As  before,  the  existence  of  cointegration  and  the  number  of  stochastic  trends 
for  the  m  series  in  Yt  depends  on  the  rank  of  the  matrix  II.  We  consider  again 
three  cases  —  namely,  rank(II)  =  m,  rank(II)  =  0,  and  rank(II)  =  r  with 

I  <  r  <  m  —  1.  If  the  variables  are  all  stationary  this  means  that  <I>(z)  has  all  its 
roots  outside  the  unit  circle.  In  this  case  the  matrix  II  has  full  rank  m.  There  are  no 
stochastic  trends  in  this  case.  On  the  other  hand,  suppose  that  II  =  —  <E(1)  =  0. 
Then  the  VAR  polynomial  <f>(z)  contains  m  unit  roots,  and  (7.39)  is  a  VAR(p  —  1) 
model  in  the  variables  A Yt.  In  this  case  there  are  m  stochastic  trends.  Finally,  if 
rank(II)  =  r  with  1  <  r  <  m  —  1,  then  the  polynomial  <E>(z)  has  m  —  r  unit  roots. 
Assuming  that  the  other  r  roots  of  <E(z)  all  lie  outside  the  unit  circle,  this  means 
that  the  nr  variables  have  (m  —  r)  common  stochastic  trends  and  that  there  are  r 
cointegration  relations.  This  can  be  seen  as  follows.  If  the  m  x  m  matrix  II  has 
rank  r,  it  can  be  written  as  II  =  AB'  =  MV ,  a $■,  where  A  and  B  are  m  x  r 
matrices  of  rank  r  with  m  x  1  columns  a,  and  /?■  respectively,  j  =  1  ,■■■  ,r.  (In  the 
literature  one  often  writes  this  matrix  decomposition  as  II  =  a/7  instead  of 

II  =  AB',  but  here  we  do  not  follow  this  convention,  and  we  write  II  =  a/7  only 
if  II  has  rank  1,  in  which  case  the  matrices  A  and  B  reduce  to  column  vectors.)  The 
VECM  implies  that  each  of  the  r  linear  combinations  f>-  Yt  is  stationary,  so  that 
there  exist  r  linearly  independent  cointegration  relations.  It  should  be  noted  that 
the  matrices  A  and  B  in  the  decomposition  II  =  AB’  are  not  defined  uniquely, 
because  II  =  (AC~1)(BC')'  =  AB'  for  every  invertible  r  x  r  matrix  C.  Therefore 
the  r  cointegration  relations  are  also  not  unique,  since  every  linear  combination 
YJ-.yCjP'-Yt  is  a  cointegration  relation  for  arbitrary  constants  Cj,  j  =  1,  •  •  •  ,r. 


Summary  of  results  for  VARfpj  model 

Summarizing  the  foregoing  results,  suppose  that  all  individual  variables  in 
the  m  x  1  vector  Yt  are  integrated  of  order  at  most  1.  The  appropriate  way  to 
model  the  series  Yt  depends  on  the  rank  of  the  m  x  m  matrix  II  in  the  VECM 
(7.39).  If  the  variables  are  jointly  stationary,  then  the  matrix  II  has  full  rank 
m.  In  this  case  the  series  do  not  contain  stochastic  trends  and  one  can 
estimate  a  VAR  model.  If  the  matrix  II  has  rank  r  =  0,  so  that  II  =  0,  then 
the  series  contain  m  stochastic  trends  and  the  variables  are  not  cointegrated. 
One  should  estimate  a  VAR  model  for  the  differenced  variables  A Yt.  If  the 
matrix  II  has  rank  1  <  r  <  m  —  1,  then  the  variables  are  cointegrated.  There 
exist  r  linearly  independent  cointegration  relations  and  (m  —  r)  common 
stochastic  trends.  One  should  estimate  the  VECM  (7.39)  with  the  restriction 
that  the  matrix  II  has  rank  r. 

Estimation  of  VAR  model  with  cointegration 

For  a  given  rank  1  <  r  <  m  —  1  of  the  matrix  II,  the  parameters  of  the 
VECM  in  (7.39)  can  be  estimated  by  ML.  This  is  in  principle  similar  to 
the  estimation  of  stationary  VAR  models  discussed  in  the  foregoing  section, 
but  of  course  one  should  incorporate  the  rank  restriction  on  IT.  The 
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corresponding  log-likelihood  can  be  maximized  by  so-called  reduced  rank 
regressions.  It  is  beyond  the  scope  of  this  book  to  treat  the  details  of  this 
optimization,  but  it  is  useful  to  know  that  no  numerical  optimization  is 
needed  and  that  the  maximum  can  be  obtained  by  means  of  regressions. 
The  maximal  value  of  the  log-likelihood,  for  given  rank  r  of  the  matrix  II, 
is  equal  to 


log (Lmax(r))  =  c-— r-^^log(l  -1;). 

;'= i 

Here  c  is  a  constant  that  does  not  depend  on  the  chosen  rank  r,  and  the 
(eigenvalues)  Xj  are  ordered  so  that  1  >  X\  >  X2  >  ■  ■  ■  >  Xm  >  0.  The  square 
root  l)11  of  these  values  are  the  so-called  (sample  partial)  canonical  correl¬ 
ation  coefficients  of  the  two  m  x  1  vectors  A  Yt  and  Yf_  1. 


Interpretation  of  the  eigenvalues  Xj 

The  eigenvalues  Xj  in  the  above  log-likelihood  for  the  VECM  have  the  following 
intuitive  interpretation.  The  series  are  cointegrated  if  there  exist  linear  combin¬ 
ations  P'Yt- 1  that  are  stationary.  This  is  expressed  in  the  VECM  (7.39)  by  the 
condition  that  II  7^  0  —  that  is,  a  non-zero  (partial)  correlation  of  Y(_i  with  the 
stationary  variables  A Yt  (for  given  values  of  A Yt~/,  j  =  1,  •  •  •  ,p  —  1).  The  number 
of  cointegration  relations  —  that  is,  the  rank  of  II  in  the  VECM  (7.39)  —  is  equal 
to  the  number  of  non-zero  correlations  between  linear  combinations  of  Yt~  1  and 
A  Y(.  The  canonical  correlations  1;'  1  measure  these  correlations.  The  number  of 
stochastic  trends  is  equal  to  the  number  of  zero  canonical  correlations.  This  means 
that  in  practice  the  number  r  of  cointegration  relations  and  the  number  (m  —  r)  of 
(common)  stochastic  trends  can  be  determined  by  checking  how  many  of  the 
canonical  correlations  differ  significantly  from  zero.  That  is, 


r  =  (number  of  significant  Xj),  m  —  r  =  (number  of  Xj  «  0). 


Note  that  the  ADF  test  regression  (7.25)  for  univariate  time  series  resembles  the 
VECM  (7.39).  The  null  hypothesis  of  a  unit  root  in  (7.25)  (that  is,  the  presence  of 
a  stochastic  trend)  corresponds  to  p  =  0.  In  this  case  Ayt  and  yt~\  are  uncorrelated 
(for  given  values  of  A yt-j,  j  =  1,  •  •  •  ,p  —  1).  In  a  similar  way,  the  null  hypothesis 
of  no  cointegration  (that  is,  the  presence  of  m  stochastic  trends)  in  the  VECM 
(7.39)  corresponds  to  II  =  0.  In  this  case  the  vectors  A  Yt  and  Yf_  1  are  uncorrelated 
(for  given  values  of  A  Y(_y,  =  1,  •  •  •  ,p  —  1).  This  corresponds  to  the  null  hypoth¬ 

esis  that  all  the  m  eigenvalues  Xj  are  zero. 


LR- test  on  the  number  of  cointegration  relations 

The  above  results  can  be  used  for  likelihood  ratio  tests  on  the  number  of 
cointegration  relations.  Because  the  constant  c  in  the  above  expression  for 
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the  log-likelihood  does  not  depend  on  the  postulated  rank,  the  LR- test  for  the 
null  hypothesis  that  rank(II)  =  r  against  the  alternative  that  rank(Il)  >  r  +  1 
is  given  by 


LR(r)  =  2(log  (Lmax(ra))  -  log  (Lmax(r)))  =  -(«  -  p)  ^  log  (1  -  1,). 

7=r+ 1 

(7.40) 

This  is  called  the  Johansen  trace  test  on  the  number  of  cointegration  rela¬ 
tions.  The  null  hypothesis  is  not  rejected  if  the  LR- test  gives  sufficiently  small 
values  —  that  is,  if  the  values  of  Xj  are  sufficiently  close  to  zero.  In  this  case 
the  last  (m  —  r)  eigenvalues  are  not  significant,  so  that  there  are  at  most  r 
significant  eigenvalues  and  hence  at  most  r  cointegration  relations.  The  trace 
test  can  be  used  as  follows  to  determine  the  number  r  of  cointegration 
relations. 


Testing  for  the  number  of  cointegration  relations 

•  Step  1:  Test  Hq  :  r  =  0  against  Hi  :  r  >  1.  First  test  the  null  hypothesis  that 
there  is  no  cointegration  and  that  there  are  m  stochastic  trends.  This 
corresponds  to  the  hypothesis  that  =  •  •  •  =  2m  =  0,  and  the  relevant 
test  statistic  is  (7.40)  with  r  =  0.  If  Ho  is  not  rejected,  then  there  is  no 
cointegration.  If  Ho  is  rejected,  continue  with  step  2. 

•  Step  2:  Test  Hq  :  r  =  1  against  Hi :  r  >  2.  In  a  similar  way  as  in  step  1,  apply 
the  test  (7.40)  with  r=  1.  If  H0  is  not  rejected  then  there  is  a  single 
cointegration  relation  and  there  are  (m  —  1)  common  trends.  If  Ho  is 
rejected,  continue  with  step  3. 

•  Step  3:  Iteratively  test  H0  :rank(Yl)  =  r  against  Hi :  rank(Yl)  >  r  +  1. 
Repeat  the  test  (7.40)  iteratively,  increasing  the  value  of  r  by  one  in  each 
step.  Continue  until  the  first  time  that  Ho  is  not  rejected.  Then  the  number 
of  cointegration  relations  is  equal  to  r  and  the  number  of  (common)  trends 
is  (m  —  r). 


The  above  LR- tests  do  not  have  the  usual  ^-distribution.  This  is  because  the 
regressors  Yf_ \  in  the  test  equations  (7.39)  contain  stochastic  trends  under 
the  null  hypothesis.  The  (asymptotic)  distribution  depends  on  the  number  of 
variables  m  and  on  the  cointegration  rank  r,  and  also  on  the  presence  of 
deterministic  components  (such  as  constants  and  deterministic  trends)  in  the 
VECM  test  equations. 

Below  we  will  discuss  three  variants  of  the  test  equations  that  are 
much  used  in  practice.  These  three  variants  are  based  on  considerations 
that  are  similar  to  those  for  the  three  types  of  unit  root  tests  discussed  in 
Section  7.3.3.  Some  critical  values  for  the  three  types  of  cointegration  tests 
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Data  properties  and 
VECM  assumptions 

m  —  r  =  1 

m  —  r  =  2 

m  —  r  =  3 

m  —  r  =  4 

m  —  r  =  5 

Linear  trend  in  data; 
constant  and  trend  in  CE 

12.25 

25.32 

42.44 

62.99 

87.31 

Linear  trend  in  data; 
constant,  no  trend  in  CE 

3.76 

15.41 

29.68 

47.21 

68.52 

No  clear  trend  in  data; 
constant,  no  trend  in  CE 

9.24 

19.96 

34.91 

53.12 

76.07 

CE  denotes  the  Cointegration  Equations. 

m  is  the  number  of  variables  and  r  is  the  number  of  cointegration  relations. 
m  —  r  is  the  number  of  unit  roots  —  that  is,  the  number  of  common  stochastic  trends  for  the 
m  variables. 

Exhibit  7.31  Cointegration  tests 

The  5%  critical  values  for  the  Johansen  trace  test  on  the  number  of  cointegration  relations  (r) 
for  three  types  of  DGP  —  that  is,  one  for  data  with  a  clear  trend  direction  and  with  a  trend  in 
the  cointegration  relations,  another  for  data  with  a  clear  trend  direction  but  without  a  trend  in 
the  cointegration  relations,  and  finally  one  for  data  without  a  clear  trend  direction  (and  no 
trend  in  the  cointegration  relations).  For  trending  data  one  should  start  by  including  constant 
and  trend  in  CE  (this  trend  term  could  be  dropped  if  it  is  not  significant). 

(7.40)  are  given  in  Exhibit  7.31.  The  critical  values  are  based  on 
the  assumption  that  the  VECM  is  correctly  specified  and  that  the  error 
terms  are  normally  distributed.  Under  these  assumptions,  the  critical  values 
depend  on  the  type  of  cointegration  test  and  on  the  number  (m  —  r)  of 
stochastic  trends,  but  they  do  not  depend  on  the  order  p  of  the  VECM. 

Test  equations  for  data  with  clear  overall  trend  direction 

The  Johansen  trace  test  (7.40)  is  based  on  the  VECM  (7.39),  possibly 
extended  with  deterministic  trend  terms.  These  trend  terms  and  the  constant 
terms  require  special  attention,  because  the  relevant  critical  values  of  the 
test  (7.40)  depend  on  the  precise  specifications  of  these  deterministic 
components. 

As  a  first  case  we  assume  that  the  data  display  a  clear  general  trend 
direction.  For  such  data  we  should  include  constant  terms  in  the  VECM 
and  a  deterministic  trend  in  the  cointegration  equations.  It  is  useful  to  rewrite 
the  resulting  VECM  in  another  form.  If  the  coefficient  matrix  II  of  Yf_j  in 
(7.39)  has  rank  r,  it  can  be  written  as  II  =  AB' ,  where  A  and  B  are  m  x  r 
matrices  of  rank  r.  Further  we  decompose  the  oixl  vector  y  in  the  VECM  as 
y  =  y1  —  Ay2  where  y2  =  —  (A'A)-1A'y  and  yx  —  (I  —  A{A'  A)~l  A') y,  so  that 
the  two  components  }q  and  Ay2  are  orthogonal  as  y\Ay2  =  0  (this  corres¬ 
ponds  to  the  OLS  decomposition  y  =  Xb  +  e  =  Hy  +  My  of  Section  3.1.3 
(p.  123),  replacing  y  by  y,  X  by  A,  b  by  —  y2,  and  e  by  jq).  With  this  notation, 
II  =  AB'  and  y  =  jq  —  Ay2,  the  relevant  test  equation  for  cointegration  of 
trending  data  can  be  written  as 
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AY^  —  yj  +  A(B'Yt- 1  —  y2  —  dt)  +  Tj  AYf_i  +  ■  ■  ■  +  Tp-\AYt~p+\  +  e*. 

The  constants  7j  are  drift  terms  for  the  stochastic  trends  in  the  variables  Yt. 
The  cointegration  relations  or  long-run  equilibria  are  described  by  the  r 
equations  B'Yt- 1  —  y2  —  St  =  0.  One  usually  expresses  this  by  saying  that 
there  is  a  linear  trend  in  the  data  (if  y-j  ^  0)  and  a  constant  and  linear  trend  in 
the  cointegration  equations  (if  y2  ^  0  and  8  ^  0).  The  LR  cointegration  test 
for  data  with  clear  trends  is  based  on  the  above  test  equations,  with  trends  in 
the  data  and  in  the  cointegration  equations. 

Model  for  cointegration  relations  without  trend 

For  the  economic  interpretation  of  equilibria  it  is  sometimes  relevant  to 
consider  the  restricted  model  with  <5  =  0,  because  in  this  case  the  linear 
combinations  B'Yt  move  around  a  constant  equilibrium  value  y2  in  the  long 
run.  The  corresponding  VECM  is  given  by 


A  Yt  —  7i  +  A(B'Yt~  i  —  y2)  +  Ti  A  Yf_i  +  ■  ■  ■  +  Tp_i  A  Yt~p+\  +  et. 

In  this  case  one  says  there  is  a  linear  trend  in  the  data  and  a  constant  but  no 
trend  in  the  cointegration  relation.  The  best  approach  is  to  determine  the 
cointegration  rank  r  first  by  means  of  the  VECM  with  deterministic  trend 
included  in  the  cointegration  relations,  as  it  is  worse  to  omit  relevant  terms  (if 
8  ^  0)  than  to  include  irrelevant  ones  (if  8  =  0).  One  can  then  test  for  the 
significance  of  the  trend  coefficients  8.  If  these  are  not  significant,  one  can 
redo  the  tests  and  estimate  a  VECM  without  trend  in  the  cointegration 
relations. 

Test  equations  for  data  without  clear  trend  direction 

If  the  variables  display  no  clear  trend  direction,  then  the  drift  terms  ya  can 
also  be  removed  from  the  model.  So  the  relevant  test  VECM  for  non-trended 
data  becomes 


A  Yt  —  A(B'Yt- 1  —  y2)  +  T  i  AY?_i  +  ■  ■  ■  +  Tp_i  A  Yt~p+\  +  e*. 

One  says  that  there  is  no  trend  in  the  data  and  in  the  cointegration 
relation. 

Overview  of  modelling  of  multiple  time  series  with  trends 

We  summarize  the  steps  needed  to  model  a  set  of  trended  variables.  Here  we 
assume  that  the  variables  are  either  integrated  of  order  0  (so  that  Yt  is  trend 
stationary)  or  integrated  of  order  1  (so  that  A Yt  is  stationary). 
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•  Step  1:  Test  for  the  nature  of  the  trends  in  the  time  series.  Test  whether  the 
trend  of  the  variables  is  deterministic  or  stochastic,  by  the  methods  dis¬ 
cussed  in  Section  7.3.3.  If  the  null  hypothesis  of  stochastic  trends  is  rejected 
and  the  trends  are  deterministic,  then  estimate  a  VAR  model  with  deter¬ 
ministic  trend  terms  included  as  regressors.  If  the  trends  are  stochastic, 
then  continue  with  step  2. 

•  Step  2:  Test  for  the  presence  of  cointegration.  If  the  variables  contain 
stochastic  trends,  then  test  for  the  presence  of  cointegration  by  means  of 
the  Johansen  trace  test.  Choose  the  relevant  VECM  test  equation  (with  or 
without  constant  terms  and  deterministic  trends),  starting  with  constants 
and  trends  included.  If  the  null  hypothesis  of  no  cointegration  is  rejected, 
then  continue  with  step  3,  and  if  this  hypothesis  is  not  rejected,  continue 
with  step  4. 

•  Step  3:  Estimation  of  VECM  with  cointegration.  If  the  series  are  cointe¬ 
grated,  then  determine  the  number  r  of  cointegration  relations  by  applying 
the  Johansen  trace  test  iteratively  until  the  null  hypothesis  of  r  relations 
against  the  alternative  of  at  least  (r  +  1)  relations  is  not  rejected.  Estimate 
the  corresponding  VECM  —  that  is,  (7.39)  where  II  has  rank  r  and  with 
relevant  constants  and  trend  terms  included. 

•  Step  4:  Estimation  of  VAR  for  AYt  in  absence  of  cointegration.  If  the  series 
are  not  cointegrated,  then  take  first  differences  of  the  data  and  estimate  a 
VAR  model  for  the  stationary  variables  A Yt. 

It  is  also  possible  to  combine  the  tests  in  the  above  four  steps,  as  follows. 
Perform  the  Johansen  trace  test  iteratively,  starting  with  r  —  0  and  increasing 
r  until  the  null  hypothesis  is  not  rejected  anymore.  If  r  =  0  is  not  rejected, 
then  the  series  are  not  cointegrated,  so  continue  with  step  4.  If  r  =  m  is  not 
rejected,  then  the  m  series  are  jointly  (trend)  stationary,  so  continue  with  step 
1  (VAR  model).  \t  1  <r  <m  —  1,  then  the  series  are  cointegrated  and  con¬ 
tinue  with  step  3  (VECM). 

Example  7.27:  Interest  and  Bond  Rates  (continued) 

We  continue  our  analysis  of  the  monthly  series  of  the  AAA  bond  rate, 
denoted  by  yt,  and  the  three-month  Treasury  Bill  rate,  denoted  by  xt.  In 
Example  7.26  in  the  foregoing  section  we  estimated  a  VAR  model  for  the 
differenced  series  Ayf  and  Axt  —  that  is,  we  performed  step  4  of  the  above 
approach  without  testing  for  the  possible  presence  of  cointegration.  Now 
we  investigate  whether  these  two  series  are  cointegrated.  We  follow  the 
above  steps  and  discuss  (i)  the  nature  of  the  trends  in  the  two  time  series, 
(ii)  the  outcomes  of  cointegration  tests,  (iii)  the  results  of  a  VECM  with 
one  cointegration  relation,  and  (iv)  an  interpretation  of  the  cointegration 
relation. 
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(i)  Nature  of  the  trends  in  both  series 

In  step  1  we  test  for  the  presence  and  nature  of  trends  in  the  data.  The  graphs 
of  the  levels  of  the  two  rates  were  given  in  Example  7.25  (see  Exhibit  7.29 
(a)).  This  indicates  that  the  series  are  not  stationary  and  that  they  do  not  have 
a  clear  long-term  trend  direction.  The  results  of  ADF  f-tests  are  in  Panels 
1  and  2  of  Exhibit  7.32,  both  for  the  test  equation  with  deterministic  trend 
included  and  for  the  test  equation  without  deterministic  trend  term.  Neither 
of  the  two  tests  can  reject  the  presence  of  a  unit  root  for  the  two  series.  The 
ADF  tests  on  the  series  of  differences  Ayf  and  Axt  show  that  these  series  do 
not  contain  a  unit  root.  We  conclude  that  the  series  yt  and  xt  are  integrated 
of  order  1. 


(ii)  Cointegration  tests 

In  step  2  we  apply  the  Johansen  test  on  cointegration.  We  test  this  in  two 
models.  First  we  use  the  general  VECM  with  linear  trend  in  the  data  and 
deterministic  trend  in  the  cointegration  relation.  Next  we  consider  the  VECM 
without  trend  in  the  data  and  in  the  cointegration  relation.  The  results  are  in 
Panels  3  and  4  of  Exhibit  7.32.  For  the  VECM  with  trends,  the  first  eigenvalue 
b  =  0.061  differs  significantly  from  zero  but  the  second  one  b  =  0.007  not. 
For  the  VECM  without  trends  the  first  eigenvalue  k\  =  0.058  is  significant  but 
the  second  one  b  =  0.006  not.  We  conclude  that  the  two  series  are  cointe¬ 
grated  and  that  there  exists  one  common  stochastic  trend. 


(iii)  VECM  with  one  cointegration  relation 

In  step  3  we  estimate  the  VECM  with  one  cointegration  relation.  Panels  5 
and  6  of  Exhibit  7.32  show  the  estimates  for  two  VECMs,  with  and  without 
trend  terms.  The  trend  in  the  cointegration  relation  is  not  significant 
( t  =  —1.52  in  Panel  5)  and  the  drift  terms  are  also  not  significant  (the  f-values 
are  0.90  for  the  AAA  equation  and  0.23  for  the  Treasury  Bill  equation,  see 
Panel  5).  This  motivates  the  use  of  the  VECM  without  drift  terms  and  with  a 
constant  (but  no  trend)  in  the  cointegration  relation.  Panel  6  of  Exhibit  7.32 
shows  the  resulting  estimated  model: 


( 


A  y, 
A  xt 


) 


/— 0.01885  / 

=  (  0.0338 

/— 0.3333  0.0364  5  /Ay,_25  /elt 
H-0.5175  -0.0324 A A*(_J  +  \e2t 


0.5137 

0.7780 


— 0.04375  /Ay,_A 
0.1699  )\A xt-i) 


In  this  model,  the  long-run  equilibrium  relation  is  estimated  as  yt  —  1.1 5xt 
—  1.28  =  0  (where  yt  and  xt  are  both  measured  in  percentages),  or 


yt  =  1.1 5xt  +  1.28. 
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(a) 


(b) 


Panel  1:  ADF  t-tests  for  AAA  and  for  A(AAA) 

AAA  (const  and  trend) 

t=  -1.266009 

1% 

Critical  Value 

-3.9779 

6  lags  included 

5% 

Critical  Value 

-3.4194 

AAA  (const  but  no  trend) 

t=  -1.615880 

1% 

Critical  Value 

-3.4437 

6  lags  included 

5% 

Critical  Value 

-2.8667 

A(AAA)  (const  but  no  trend) 

t=  -8.770391 

1% 

Critical  Value 

-3.4437 

6  lags  included 

5% 

Critical  Value 

-2.8667 

Panel  2:  ADF  t-tests  for  US3MT  and  for  A(US3MT) 

US3MT  (const  and  trend) 

t  =  -1.974514 

1% 

Critical  Value 

-3.9779 

6  lags  included 

5% 

Critical  Value 

-3.4194 

US3MT  (const  but  no  trend) 

t  =  -2.053994 

1% 

Critical  Value 

-3.4437 

6  lags  included 

5% 

Critical  Value 

-2.8667 

A(US3MT)  (const  but  no  trend) 

t  =  -11.97861 

1% 

Critical  Value 

-3.4437 

6  lags  included 

5% 

Critical  Value 

-2.8667 

^  Panel  3:  Johansen  test  on  cointegration 

Test  assumption:  Trend  in  the  data,  trend  and  constant  in  coint  relation 
Series:  AAA  US3MT 

Sample:  1950:01  1999:12;  Included  observations:  600;  Included  lags:  4 
Eigenvalue  Likelihood  5  Percent  1  Percent  Hypothesized 
(2)  Ratio  test  Critical  Value  Critical  Value  No.  of  CE(s) 

0.061070  (+)  41.82298  25.32  30.45  None  (r  =  0) 

0.006668  (72 )  4.014194  12.25 _ 16.26  Atmostl(r<l) 

LR  test  indicates  1  cointegrating  equation  at  5%  and  at  1%  significance  level 


(d) 


Panel  4:  Johansen  test  on  cointegration 

Test  assumption:  No  trend  in  the  data,  constant  but  no  trend  in  coint  relation 
Series:  AAA  US3MT 

Sample:  1950:01  1999:12;  Included  observations:  600;  Included  lags:  4 
Eigenvalue  Likelihood  5  Percent  1  Percent  Hypothesized 
(2)  Ratio  test  Critical  Value  Critical  Value  No.  of  CE(s) 

0.058389  (2i)  39.90205  19.96  24.60  None  (r  =  0) 

0.006320  (+)  3.804047 _ +24 _ 12.97  At  most  1  (r  <  1 ) 

LR  test  indicates  1  cointegrating  equation  at  5%  and  at  1%  significance  level 


Exhibit  7.32  Interest  and  Bond  Rates  (Example  7.27) 

Unit  root  tests  for  the  series  of  the  AAA  bond  rate  (Panel  1)  and  for  the  three-month  Treasury 
Bill  rate  (Panel  2),  and  Johansen  cointegration  tests  (with  trend  in  data  and  with  constant  and 
trend  in  cointegration  relation  in  Panel  3,  and  without  trends  in  Panel  4). 


This  means  that,  in  equilibrium,  the  AAA  bond  rate  is  higher  than  the  three- 
month  Treasury  Bill  rate.  For  instance,  if  xt  =  5  per  cent,  then  in  equilibrium 
yt  =  7.03  per  cent.  The  adjustment  coefficients  are  —0.019  for  the  AAA  bond 
rate  and  0.034  for  the  Treasury  Bill  rate.  For  instance,  if  the  two  rates  are  out 
of  equilibrium  in  the  sense  that  yt  >  1.15^  +  1.28,  then  this  leads  to  a 
downward  adjustment  of  yt  (of  1.9  per  cent  of  the  difference)  and  an  upward 
adjustment  of  xt  (of  3.4  per  cent  of  the  difference).  This  means  that  the  rates 
are  adjusted  in  the  direction  of  equilibrium.  The  adjustment  is  rather  slow,  as 
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Panel  5:  VECM  with  trends  and  with  1  cointegration  relation 

Sample:  1950:01  1999:12  (600  observations) 

Cointegrating  Eq: 

Coefficient 

Std.  Error 

t-Statistic 

AAA(-l) 

1.000000 

US3MTBIL(— 1) 

-1.051707 

0.09014 

(-11.6670) 

@TREND(  1950:01) 

-0.002080 

0.00137 

(-1.51513) 

C 

-0.234493 

Error  Correction 

Equation  for 

Equation  for 

Model 

D(AAA) 

t-Statistic 

D(US3MT) 

t-Statistic 

Coint.  Equation 

-0.022650 

(-3.43266) 

0.039050 

(2.62909) 

D(AAA(  — 1)) 

0.513984 

(10.7490) 

0.773596 

(7.18723) 

D(AAA(— 2)) 

-0.331826 

(-6.75667) 

-0.523998 

(-4.74004) 

D(US3MTBIL(— 1)) 

-0.044806 

(-2.11576) 

0.172205 

(3.61252) 

D(US3MTBIL(— 2)) 

0.034777 

(1.62788) 

-0.029947 

(-0.62275) 

C 

0.007015 

(0.89989) 

0.004101 

(0.23373) 

R-squared 

0.222181 

0.197649 

Log  Likelihood 

-79.22656 

Akaike  Information  Criterion  0.314089 

Schwarz  Criterion 

0.424012 

Panel  6:  VECM  without  trends  and  with  1  cointegration  relation 

Sample:  1950:01  1999:12  (600  observations) 

Cointegrating  Eq: 

Coefficient 

Std.  Error 

t-Statistic 

AAA(-l) 

1.000000 

US3MT(— 1) 

-1.150871 

0.08209 

(-14.0192) 

C 

-1.275931 

0.46790 

(-2.72690) 

Error  Correction 

Equation  for 

Equation  for 

Model 

D(AAA) 

t-Statistic 

D(US3MT) 

t-Statistic 

Coint.  Equation 

-0.018831 

(-3.32220) 

0.033829 

(2.65473) 

D(AAA(— 1)) 

0.513660 

(10.7291) 

0.777986 

(7.22849) 

D(AAA(— 2)) 

-0.333304 

(-6.77534) 

-0.517504 

(-4.67940) 

D(US3MT(— 1)) 

-0.043715 

(-2.06522) 

0.169932 

(3.57109) 

D(US3MT(— 2)) 

0.036440 

(1.71060) 

-0.032351 

(-0.67553) 

R-squared 

0.220254 

0.197719 

Log  Likelihood 

-80.26316 

Akaike  Information  Criterion  0.310877 

Schwarz  Criterion 

0.406144 

(g) 


Panel  7:  ADF  Test 
5%  Critical  Value 


t=  -3.264717 
-2.8667 


Variable:  AAA  -  US3MT 
constant  but  no  trend,  6  lags  included 


(h) 


|  —  AAA-US3MT  | 

Exhibit  7.32  ( Contd .) 

VECM  for  the  series  of  the  AAA  bond  rate  and  the  three-month  Treasury  Bill  rate  (model  with 
trends  in  Panel  5  and  model  without  trends  in  Panel  6),  unit  root  test  on  the  difference  of  the 
two  rates  (AAA  -  US3MT,  Panel  7)  and  time  plot  of  this  difference  (h). 
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only  around  5.3  per  cent  of  the  gap  from  equilibrium  is  adjusted  within  the 
period  of  one  month. 

(iv)  Interpretation  of  the  cointegration  relation 

The  slope  coefficient  of  1.15  in  the  cointegration  relation  is  quite  close  to  1, 
which  suggests  that  the  spread  (yt  —  xt)  between  the  two  rates  may  be  sta¬ 
tionary.  This  can  be  tested  by  an  ADF  test  on  this  series.  The  results  in  Panels  7 
and  8  of  Exhibit  7.32  show  that  the  spread  (yt  —  xt)  is  indeed  stationary  (at 
5  per  cent  significance  level).  This  equilibrium  relation  has  a  clear  financial 
interpretation.  In  the  long  run  the  difference  between  the  two  rates  stays 
constant.  This  means  that  the  additional  risk  premium  of  AAA  bonds,  as 
compared  to  that  of  Treasury  Bills,  remains  unaltered  in  the  long  run. 
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Example  7.28:  Treasury  Bill  Rates 

As  a  second  example  we  consider  three  series  that  we  expect  to  be  linked 
together  in  the  long  run  —  namely,  Treasury  Bill  rates  for  three  different 
maturities.  We  will  discuss  (i)  the  data,  (ii)  a  test  for  the  number  of  trends, 
(iii)  a  VECM  with  two  cointegration  relations  (that  is,  with  one  common 
trend  for  the  three  series),  and  (iv)  an  interpretation  of  the  two  cointegration 
relations. 

(i)  The  data 

The  three  considered  time  series  are  the  three-month,  one-year,  and  ten-year 
Treasury  Bill  rates  in  the  USA,  with  monthly  observations  over  the  years 
1960-99.  The  data  are  measured  in  percentages  and  are  taken  from  the 
Federal  Reserve  Board  of  Governors.  The  time  plot  of  the  three  series  is  in 
Exhibit  7.33  (a).  This  graph  suggests  that  the  three  series  do  not  have  a  clear 
trend  direction  and  that  they  possibly  follow  a  random  walk  with  one 
common  stochastic  trend. 

(ii)  Test  for  the  number  of  trends 

We  test  for  the  number  of  cointegration  relations  by  applying  the  Johansen 
trace  test.  It  turns  out  that  the  drift  terms  and  the  trend  in  the  cointegration 
relation  can  be  omitted.  Therefore  we  present  the  results  for  the  VECM 
without  trends  in  Panel  2  of  Exhibit  7.33.  The  test  gives  eigenvalues 
li  =  0.078,  I2  =  0.041,  and  I3  =  0.008.  The  first  two  eigenvalues  differ 
significantly  from  zero  (at  5  per  cent  significance),  but  the  third  one  does 
not.  This  means  that  the  matrix  FI  in  the  VECM  for  these  three  series  has 
rank  r  =  2.  So  there  are  two  cointegration  relations  between  the  variables 
and  there  is  one  common  stochastic  trend  that  drives  all  three  interest  rates. 
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(a) 


(b) 


Panel  2:  Johansen  test  on  cointegration 

Test  assumption:  No  deterministic  trend  in  the  data 

Series:  T_3M,  T_1Y,  T_10Y 

Sample:  1960:01  1999:12;  Included  observations:  475;  Included  lags:  4 


Eigenvalue 

w 

Likelihood 
Ratio  test 

5  Percent 
Critical  Value 

1  Percent 
Critical  Value 

Hypothesized 
No.  of  CE(s) 

0.077817  (7i) 

62.51845 

34.91 

41.07 

None  (r  =  0) 

0.041226  (k2) 

24.03773 

19.96 

24.60 

At  most  1  (r  <  1) 

0.008470  (73) 

4.040461 

9.24 

12.97 

At  most  2  (r  <  2) 

LR  test  indicates  2  cointegrating  equations  at  5%  significance  level 

(c) 


Panel  3:  VECM  without  trends  and  with  2  cointegration  relations 

Sample(adjusted):  1960:04  1999:12  (477  observations) 

Cointegrating 

Coefficients 

Coefficients 

Equation: 

Coint  Eq  1 

Std.  Error 

t-Statistic 

Coint  Eq  2 

Std.  Error 

t-Statistic 

T_3M(— 1) 

1.000000 

0.000000 

T_1Y(— 1) 

0.000000 

1.000000 

T_10Y(  — 1) 

-0.882970 

0.08490 

-10.3999 

-0.972623 

0.07585 

-12.8233 

C 

0.574733 

0.65977 

0.87111 

0.595891 

0.58941 

1.01099 

Error  Correction 

Equation  for 

Equation  for 

Equation  for 

Model 

D(T_3M) 

t-Statistic 

D(T_1Y) 

t-Statistic 

D(T_10Y) 

t-Statistic 

CointEquation  1 

-0.138592 

-2.07326 

0.052581 

0.79810 

0.116110 

2.82168 

CointEquation  2 

0.124560 

1.61478 

-0.092694 

-1.21928 

-0.115427 

-2.43089 

D(T_3M(— 1)) 

0.048494 

0.41095 

-0.114026 

-0.98042 

-0.082522 

-1.13602 

D(T_3M(— 2)) 

0.011969 

0.10133 

0.170775 

1.46689 

0.136141 

1.87226 

D(T_1Y(  — 1)) 

0.362771 

2.21719 

0.401741 

2.49133 

0.070020 

0.69520 

D(T_1Y(— 2)) 

-0.143895 

-0.86784 

-0.291773 

-1.78547 

-0.134795 

-1.32065 

D(T_10Y(— 1)) 

0.156030 

1.08619 

0.322656 

2.27904 

0.39828 5 

4.50414 

D(T_10Y(— 2)) 

-0.221795 

-1.53246 

-0.185207 

-1.29841 

-0.188165 

-2.11202 

R-squared 

0.219135 

0.214765 

0.190565 

Exhibit  7.33  Treasury  Bill  Rates  (Example  7.28) 

Time  plot  of  three  US  interest  rates  (three  months,  one  year,  and  ten  years  (a)),  Johansen 
cointegration  test  (Panel  2),  and  VECM  (Panel  3). 
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(d)  ( e )  (f) 


|—  1  >11:1  _  I  IHYTIY  |  |  —  DIFFJl  lOYTSAll  |— PUT  T1YT3.M  I 


Panel  7:  ADF  t-tests  for  three  interest  spreads 

Constant,  but  no  trend  in  test  equation,  and  4  lags  included 

spread 

ADF  t-Statistic 

5%  Critical  Value 

tioy  -  tiy 

-3.259773 

-2.8679 

1 10y  Din 

-3.657539 

-2.8679 

tiy  -  r3m 

-4.983038 

-2.8679 

Exhibit  7.33  (C ontd.) 

Time  plot  of  three  interest  spreads  {(d)-(f))  and  unit  root  tests  (Panel  7). 


(iii)  VECM  with  two  cointegration  relations 

Panel  3  of  Exhibit  7.33  shows  the  estimated  VECM  with  two  cointegration 
relations.  The  long-run  equilibrium  relations  are  estimated  as 


rim  =  0.88rioy  —  0.57,  r\y  —  0.97r\oy  —  0.60. 


Here  the  index  of  the  Treasury  Bill  rate  r  denotes  the  maturity,  three  months 
(3m),  one  year  (ly),  or  ten  years  (lOy).  The  outcomes  are  in  line  with 
financial  theory,  since  in  equilibrium  the  interest  rates  should  be  higher  for 
longer  maturities. 

(iv)  Interpretation  of  the  two  cointegration  relations 

The  slope  coefficients  of  the  cointegration  equations  are  quite  close  to  one. 
Therefore  we  test  whether  the  three  series  of  interest  spreads  (r\oy  —  r \y), 
(tioy  ~  n,„),  and  (rly  —  f3m)  are  stationary.  We  apply  ADF  tests,  with  a 
constant  term  but  without  deterministic  trend  in  the  test  equation.  The 
outcomes  in  Panel  7  of  Exhibit  7.33  show  that  the  null  hypothesis  of  a  unit 
root  can  be  rejected  for  all  three  spreads.  So  the  spreads  are  stationary  and 
move  up  and  down  along  a  long-run  equilibrium  value.  The  graphs  of  the 
spreads  in  Exhibit  7.33  (d-f)  indeed  indicate  the  existence  of  such  equilibria 
over  long  time  horizons,  although  deviations  may  persist  for  considerable 
time.  For  instance,  note  that  in  the  early  1980s  the  spreads  were  negative  so 
that  long-term  interest  rates  were  lower  than  short-term  interest  rates.  How¬ 
ever,  this  disequilibrium  situation  has  been  corrected  over  time. 


Exercises:  S:  7.15e;  E:  7.17h,  i,  7.20f,  7.21c,  7.23b,  c,  7.24b-d. 
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7.6.4  Summary 

In  this  section  we  have  considered  the  joint  modelling  of  a  set  of  mutually 

dependent  time  series  variables. 

•  If  the  variables  are  (trend)  stationary,  one  can  estimate  a  vector  auto¬ 
regressive  model  by  means  of  least  squares.  The  estimated  model  can  be 
analysed  by  means  of  the  usual  diagnostic  tests  for  regression  models 
and  AR  time  series  models.  The  model  can  be  rewritten  in  error  correc¬ 
tion  form,  which  provides  insight  in  the  correction  mechanisms  between 
the  time  series. 

•  If  the  variables  contain  stochastic  trends,  then  one  should  investigate 
whether  the  variables  are  cointegrated.  The  appropriate  cointegration 
test  equation  for  the  Johansen  trace  test  depends  on  the  data  properties 
(clear  overall  trend  direction  or  not,  deterministic  trend  in  cointegration 
relations  or  not). 

•  If  the  series  are  not  cointegrated,  then  regressions  of  the  variables  in 
levels  may  lead  to  spurious  results.  Therefore  the  model  should  be 
estimated  only  after  the  stochastic  trends  have  been  removed  by  taking 
first  differences  of  the  data. 

•  If  the  series  are  cointegrated,  then  one  should  estimate  a  VECM  (with 
reduced  rank  of  the  matrix  II  of  the  regressor  Yf_i).  If  there  are  m 
variables  and  rank(II)  =  r,  then  there  are  r  cointegration  relations  and 
(m  —  r )  common  trends.  The  estimated  model  has  a  clear  interpretation 
in  terms  of  long-run  equilibria  and  adjustment  mechanisms  that  pre¬ 
serve  these  equilibria. 
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7.7  Other  multiple  equation 
models 

Uses  Chapters  1-4;  Sections  5.4,  5.5,  5.7;  Sections  7. 1-7. 3,  7.6. 


7.7.1  Introduction 

Combined  cross  section  and  time  series  data 

Many  data  sets  of  practical  interest  consist  of  a  large  number  of  variables  that 
are  observed  on  a  sequence  of  successive  moments  in  time.  Examples  are 
yearly  time  series  observations  of  the  production  of  a  large  number  of  firms, 
weekly  purchases  of  a  large  number  of  households,  and  quarterly  develop¬ 
ments  in  gross  national  product  of  a  large  number  of  countries.  In  such  cases 
the  data  concern  a  cross  section  of  units  (firms,  households,  countries),  and 
for  each  unit  a  time  series  of  observations  is  available.  If  the  information 
consists  of  such  a  combination  of  cross  section  and  time  series  data,  then  one 
says  that  the  data  are  pooled.  Such  data  are  also  called  panel  data,  where  the 
panel  refers  to  the  cross  section  (of  firms,  households,  countries,  and  so  on). 

The  vector  autoregressive  model  of  Section  7.6  is  an  example  of  a  model 
for  a  number  of  observed  time  series  variables.  However,  as  was  discussed  in 
Section  7.6.2,  this  model  is  not  suitable  for  a  large  number  of  time  series, 
because  the  number  of  parameters  in  the  VAR  model  grows  with  the  square 
of  the  number  of  time  series.  In  practice,  pooled  data  sets  often  contain  a 
large  number  of  units  (dozens  of  countries,  hundreds  of  firms,  thousands  of 
households).  For  instance,  if  the  data  set  consists  of  fifty  weekly  observations 
of  1000  households,  then  a  VAR(l)  model  for  these  1000  series  has  more 
than  a  million  parameters.  Such  data  sets  should  be  analysed  in  another  way. 

General  model  formulation 

We  suppose  that  the  data  set  consists  of  time  series  observed  at  n  time 
moments  (t  =  1,  ■  ■  • ,  n)  for  a  number  m  of  units  (i  =  1,  •  •  • ,  m).  The  variable 
to  be  explained  is  denoted  by  ylt  and  we  assume  that  there  are  (k  —  1) 
explanatory  variables  Xjt,  where  i  denotes  the  unit  and  t  the  time  moment. 
In  all  that  follows  we  exclude  the  constant  term  from  Xjt,  as  this  term  plays  a 
special  role  in  the  models  that  we  will  discuss  in  this  section.  The  considered 
models  are  of  the  following  general  form: 
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Jit  =  y-a  +  x'ityit  +  &it,  i  =  1,  ■  ■  ■ ,  m,  *  =  1,  •••,«,  var(g)  =  ft. 

Here  ft  denotes  the  mn  x  covariance  matrix  of  the  mn  x  1  vector  e  of 
disturbances  Sjt.  This  model  is  far  too  general  to  be  of  practical  use,  as  it 
contains  more  parameters  than  observations.  The  number  of  observations  is 
mn  and  the  number  of  parameters  is  kmn  +  \mn(mn  +  1)  (namely,  mn  for 
the  constants  a n,  (k  —  1  )mn  for  the  slopes  yit,  and  \mn{mn+  1)  for  the 
symmetric  matrix  ft).  Therefore,  to  be  practically  useful  we  have  to  impose 
restrictions  on  the  regression  parameters  (a it,  yit)  and  on  the  covariance 
matrix  ft. 

Three  models  of  special  interest 

In  the  rest  of  this  section  we  pay  attention  to  three  specifications  that  are  of 
much  practical  interest.  In  Section  7.7.2  we  consider  the  seemingly  unrelated 
regression  model  that  is  characterized  by  the  restrictions  that 

y-,t  =  y,t  =  yn  E[eltejt\  =  a,n  E[sitsjs\  =  0  (for  all  i,  /,  and  t  ^  s). 

So  the  parameters  differ  between  units,  but  they  are  constant  over  time  for 
each  given  unit.  Further,  the  error  terms  are  uncorrelated  over  time,  but  the 
error  terms  are  correlated  between  units  at  the  same  moment  of  time.  This 
model  contains  in  total  mk  +  \m(m  +  1)  parameters.  As  we  shall  see  in 
Section  7.7.2,  this  model  can  be  used  only  if  n  >  m  —  that  is,  the  length  of 
the  time  series  should  be  at  least  as  large  as  the  number  of  units.  In  practice 
the  number  of  units  may  be  very  large,  in  which  case  panel  data  models  are 
appropriate.  This  is  discussed  in  Section  7.7.3.  The  panel  data  model  (with 
fixed  effects)  corresponds  to  the  restrictions 

a»t  =  a;,  Vh  =  Y,  &  =  ff2T 

So  the  slope  parameters  are  constant  across  units,  and  the  error  terms 
are  uncorrelated  (also  between  units  at  the  same  moment  of  time)  and 
homoskedastic.  In  this  way  the  number  of  parameters  is  drastically 
reduced,  to  m  +  k.  Finally,  in  Section  7.7.4  we  consider  simultaneous 
equation  models.  These  models  have  the  same  structure  as  the  seemingly 
unrelated  regression  model.  The  crucial  difference  is  that  some  of  the  ex¬ 
planatory  variables  are  endogenous,  as  the  dependent  variable  y,t  of  one 
unit  plays  the  role  of  an  explanatory  variable  in  the  equations  for  y]t  of 
other  units. 

In  the  following  sections  we  briefly  describe  some  of  the  salient  features  of 
these  models.  For  further  details  we  refer  to  the  textbooks  mentioned  in  the 
Further  Reading  section  at  the  end  of  this  chapter. 
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7.7.2  Seemingly  unrelated  regression  model 

Motivation  of  the  SUR  model 

Suppose  that  the  data  set  consists  of  time  series  for  a  number  of  units.  It  is 
assumed  that  the  marginal  effects  of  the  explanatory  variables  differ  per  unit, 
but  that  these  effects  are  constant  over  time.  In  terms  of  the  general  model  in 
Section  7.7.1,  this  means  that  a,-f  =  a,  and  yit  =  yr  Further  it  is  assumed  that 
the  disturbances  of  each  unit  are  not  serially  correlated.  However,  at  a  given 
time  the  disturbances  of  the  different  units  may  be  correlated.  This  reflects 
the  possibility  that  unobserved  influences  may  affect  all  units  simultaneously. 
The  assumptions  on  the  error  terms  are  £[fi,ye;s]  =  0  for  all  t  ^  s  and  for  all 
i,  j,  and  E[sjtSjt j  =  er,,  is  possibly  non-zero.  One  says  that  the  error  terms 
contain  contemporaneous  correlation.  If  the  explanatory  variables  x,t  are 
exogenous,  the  resulting  model 

Vit  =  a<  +  x'lty,  +  sit, 

E[ejtSjt ]  =  <Jij,  E[sitSjs\  =  0  (for  all  i,  /',  and  t  ^  s), 

is  called  the  seemingly  unrelated  regression  (SUR)  model.  The  equations  for 
different  units  seem  to  be  unrelated,  as  all  parameters  are  different.  However, 
the  observations  ylt  and  yg  are  related  if  a,j  ^  0.  So  the  relations  between  the 
units  are  modelled  implicitly  via  the  correlation  of  the  error  terms.  For 
instance,  future  expectations  may  influence  the  behaviour  of  all  households 
in  the  data  set,  and  changes  in  the  state  of  the  world  economy  may  affect  the 
profit  of  all  firms  or  the  national  income  of  all  countries  in  the  data  set. 
Influential  factors  like  future  expectations  and  worldwide  prospects  are 
difficult  to  measure,  and  their  influence  on  all  units  is  incorporated  in  the 
unobserved  error  terms  slt. 


Model  formulation  in  matrix  form 

Let  the  observations  be  ordered  per  unit,  and  per  unit  let  the  observations  be 
ordered  with  time.  We  denote  the  data  for  unit  i  by  the  n  x  1  vector  y,  and  by  the 
n  x  k  matrix  X,,  with  corresponding  k  x  1  parameter  vector  /?;  =  (a,-,  y')'  and 
n  x  1  vector  of  disturbances  £,.  The  model  for  unit  i  can  then  be  written  as 


Ji  —  +  £;. 

The  parameters  could  be  estimated  by  applying  OLS  per  unit  separately. 
However,  this  is  not  efficient  if  the  disturbances  contain  contemporaneous  correl¬ 
ation.  By  combining  the  models  for  the  m  units,  the  SUR  model  can  be  written 
in  matrix  form  as  follows,  where  s=  (s\,  ■  ■  ■ ,  E'm)'  is  the  mn  x  1  vector  of  disturb¬ 
ances. 
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(7.41) 


Here  I  denotes  the  n  x  n  identity  matrix.  This  is  a  regression  model  in  standard 
matrix  form,  but  the  error  terms  may  be  heteroskedastic  and  they  may  be  correl¬ 
ated.  Because  of  the  block-diagonal  structure  of  the  matrix  of  explanatory  vari¬ 
ables,  the  OLS  estimator  of  the  mk  parameters  /?,,  i  =  l,---,  m,  in  the  above 
model  is  equivalent  to  applying  OLS  per  unit.  However,  this  is  not  the  best  linear 
unbiased  estimator,  because  the  nin  x  mn  covariance  matrix  fl  of  the  model  is  not 
of  the  form  a 2I.  In  what  follows  we  describe  a  general  method  to  estimate 
regression  models  for  which  the  covariance  matrix  fl  does  not  have  the  standard 
form  (the  SUR  model  is  a  special  case). 


The  method  of  generalized  least  squares 

An  efficient  estimator  is  obtained  by  transforming  the  model  such  that  the  covar¬ 
iance  matrix  becomes  of  the  form  a2I.  This  is  called  generalized  least  squares 
(GLS).  The  idea  is  similar  to  the  method  of  weighted  least  squares  discussed  in 
Section  5.4.3  (p.  327-30)  for  a  diagonal  covariance  matrix  fl.  As  the  covariance 
has  another  structure  here,  we  need  to  apply  another  type  of  transformation  to 
estimate  the  SUR  model.  The  general  idea  is  to  transform  the  model  y  =  X/l  +  e 
(where  e  has  mean  0  and  covariance  matrix  fl)  by  means  of  an  invertible  matrix  A 
to  Ay  =  AXfi  +  As  in  such  a  way  that  the  transformed  error  vector  As  (that  has 
mean  0)  has  covariance  matrix  I.  As  As  has  covariance  matrix  AflA',  the  condi¬ 
tion  on  A  is  that  AflA1  =  I,  in  which  case  fl  =  A_1(A')_1  =  (A'A)-1.  If  we  write 
y*  =  Ay,  X*  =  AX,  and  e*  =  As,  then  y*  =  X»/i  +  e*  with  £[e*]  =  0  and 
var(e*)  =  I.  Therefore,  the  best  linear  unbiased  estimator  of  fl  is  given  by 


bGLS  =  (KXJ-'Kv*  =  (X'fl^X^X'fl-y 

where  we  used  the  fact  that  A! A  =  fl-1.  This  is  called  the  generalized  least  squares 
estimator. 


Methods  to  compute  the  transformation  matrix 

There  are  different  methods  to  compute  the  transformation  matrix  A.  A 
general  method  is  the  following.  According  to  the  matrix  results  in  Appendix 
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A  (Sections  A.5  and  A. 6),  the  positive  definite  covariance  matrix  ft  can  be 
written  as 


ft  =  VDV', 

where  D  is  a  diagonal  matrix  with  the  positive  eigenvalues  l,t  of  ft  on  the  diagonal 
and  where  V  is  an  orthogonal  matrix  with  columns  that  are  eigenvectors  of  ft  and 
with  the  property  that  V'V  =  I.  Now  define  A  =  D_1/,2V',  where  D -1/2  is  the 
diagonal  matrix  with  elements  1  /y/Ift  on  the  diagonal.  Then  AftA'  =  I,  which 
follows  by  writing  out  this  matrix  product  and  using  that  V'V  =  1  and 
D-1/2DD-1/2  =  I.  This  general  method  is  not  always  the  most  convenient  way 
to  find  a  transformation  matrix  A ,  as  the  involved  matrices  ft  and  A  are  of 
size  mn  x  mn.  This  is  often  too  large  to  be  practically  feasible.  Note  that  the 
mn  x  mn  matrix  ft  in  (7.41)  is  sparse,  in  the  sense  that  it  contains  many  zero 
elements.  Special  methods  have  been  developed  to  obtain  transformations  A  for 
sparse  matrices  ft,  but  here  we  will  not  discuss  these  computational  aspects 
any  further. 


Feasible  GLS 

The  above  GLS  estimator  bcLS  can  be  computed  only  if  the  covariance 
matrix  ft  is  known,  but  this  is  not  the  case  in  practice.  The  feasible  GLS 
estimator  (FGLS)  of  the  SUR  model  is  computed  in  two  steps.  In  the  first  step 
the  matrix  ft  is  estimated  by  least  squares  methods,  and  in  the  second  step 
this  estimated  covariance  matrix  is  used  in  GLS.  This  approach  is  similar  to 
the  method  of  feasible  WLS  for  heteroskedastic  error  terms  discussed  in 
Section  5.4.4  (p.  335-6). 


Two-step  feasible  generalized  least  squares  (FGLS) 

•  Step  1:  Estimate  the  covariance  matrix  ft.  Apply  m  regressions,  one  per  unit 
to  estimate  [f  by  OLS,  j  =  1,  •  •  • ,  m.  Let  e,  be  the  n  x  1  vector  of  OLS 
residuals  for  unit  i;  then  the  (co)variances  (T,j  are  estimated  by 
Sjj  =  1  Yft-i  eitejt ■  The  mn  x  mn  matrix  ft  is  estimated  by  replacing  oy  in 
(7.41)  by  Sjj. 

•  Step  2:  Estimate  the  parameters  jf  jointly  by  GLS.  The  FGLS  estimator  is 
obtained  by  substituting  the  estimated  covariance  matrix  ft  of  step  1  into 
the  expression  for  the  GLS  estimator,  so  that 

bFGLS  =  (X'ft^X^X'ft-V 


It  can  be  shown  that  this  estimator  has  the  same  (optimal)  asymptotic 
properties  as  ML  in  (7.41)  if  the  error  terms  are  normally  distributed,  the 


7.7  Other  multiple  equation  models 


687 


number  of  time  series  observations  n  — »  oo,  and  the  number  m  of  units  is 
fixed.  In  particular,  if  n  is  large  enough  then 

bFGLS  »  N(y?,  (X'.X*)-1)  «  N(/f,  (X'a^X)-1). 

This  can  be  used  to  perform  t-  and  F- tests,  provided  that  the  length  n  of  the 
time  series  is  large  enough.  The  first  step  in  the  above  FGLS  method  is  not 
efficient,  as  it  neglects  the  fact  that  the  error  terms  are  heteroskedastic  and 
correlated.  One  can  iterate  the  FGLS  steps,  using  the  residuals  of  step  2  to 
make  a  new  estimate  of  ft  in  step  1  and  a  corresponding  new  GLS  estimate 
in  step  2.  This  can  be  iterated  until  the  estimates  converge.  This  is  called 
iterated  FGLS. 


SUR  requires  a  limited  number  of  units 

The  SUR  model  can  be  estimated  only  if  the  number  of  units  m  is  not  larger  than  the 
number  of  time  series  observations  n  per  unit.  This  can  be  derived  as  follows.  To 
compute  the  SUR  estimator  bpcLS  (in  step  2),  the  estimated  covariance  matrix  ft  (of 
step  1)  should  be  invertible.  Here  ft  is  an  mn  x  mn  block-diagonal  matrix  with  the 
m  x  m  matrix  S  on  the  diagonal,  where  S  has  elements  s,y  =  \  e'^j  with  e,  the  »xl 
vector  of  OLS  residuals  of  unit  i.  Therefore  ft  is  invertible  if  and  only  if  S  is 
invertible  —  that  is,  if  and  only  if  rank(S)  =  m.  Now  define  the  n  x  m  matrix 
E  =  (ei,  ■  •  • ,  em);  then  S  =  \E'E  and  m  =  rank(S)  =  rank(£)  <  n.  So  S  and  ft  can 
be  invertible  and  bpcLS  can  be  computed  only  if  m  <  n.  If  the  number  of  units 
present  in  the  data  set  exceeds  the  length  of  the  observation  period  per  unit,  then 
SUR  cannot  be  applied.  Models  that  are  appropriate  for  such  data  sets  are  discussed 
in  the  next  section. 


Cases  where  OLS  is  efficient 

There  are  two  situations  for  which  the  GLS  estimator  in  the  SUR  model  boils 
down  to  the  OLS  estimator  per  unit.  This  happens  if  the  different  units  are 
uncorrelated  so  that  <7,;  =  0  for  all  i  /  /,  or  if  all  units  have  the  same  regressor 
matrix  in  the  sense  that  X,  =  X  is  the  same  for  all  i  =  1,  ■  •  • ,  m  (see  Exercise  7.10). 
In  these  cases  OLS  per  unit  is  efficient. 

The  null  hypothesis  of  uncorrelated  units  can  be  tested  by  means  of  the  OLS 
residual  vectors  e,.  Let  the  sample  correlation  of  the  residuals  of  units  i  and  j  be 
defined  by  r,y  =  s;y /y/saSjj,  with  s,y  as  defined  in  step  1  of  the  FGLS  method.  If 
< T[j  =  0,  then  m  w  N(0,  \ ),  and  the  LM-test  for  the  absence  of  cross-unit  correl¬ 
ations  (ffjj  =  0  for  all  i  ^  j)  is  given  by 


m—  1  m 


LM  =  nYJY, 


i=  1  /— z'+l 
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This  is  similar  to  the  Box-Pierce  test  (5.50)  for  serial  correlation.  If  the  null 
hypothesis  is  not  rejected,  then  one  can  apply  OLS  per  unit  without  loss  of 
efficiency;  otherwise  FGLS  is  preferred. 


y  >t  ? 


XM729PMI 


Example  7.29:  Primary  Metal  Industries 

To  illustrate  some  of  the  foregoing  methods  we  consider  yearly  production 
data  of  m  =  26  firms  in  the  US  primary  metal  industries  (SIC33)  over  the 
period  1958-94,  so  that  n  =  37.  The  data  are  taken  from  the  National 
Bureau  of  Economic  Research  (E.  J.  Bartelsman  and  W.  Gray,  The  NBER 
Manufacturing  Productivity  Database,  National  Technical  Working  Paper 
205, 1996).  We  will  discuss  (i)  the  data,  (ii)  the  estimates  of  an  SUR  model  for 
a  subset  of  three  firms,  (iii)  the  SUR  estimates  for  the  full  set  of  twenty-six 
firms,  and  (iv)  comments  on  the  outcomes. 

(i)  The  data 

The  dependent  variable  y,t  is  the  output  (value  added)  and  the  explanatory 
variables  are  the  input  factors  Llt  (labour,  production  worker  wages)  and  Klt 
(capital  stock,  both  structures  and  equipment),  all  measured  in  millions  of 
1987  dollars.  The  Cobb-Douglas  production  function  is  ylt  =  I J!‘ K-'e'1' e’:" , 
and  by  taking  logarithms  this  can  be  written  as 

log  (ylt)  =  ct,  +  Pt  log  (Lu)  +  y,  log  (Kit)  +  %• 

The  data  (all  in  logarithms)  are  shown  in  Exhibit  7.34.  Although  some  firms 
have  a  growing  or  declining  output  over  time,  this  is  by  no  means  a  general 
characteristic  of  the  data.  Here  we  will  not  incorporate  trends  in  the  model. 
Further,  although  the  firms  operate  in  the  same  industry  sector,  we  allow  for 
different  labour  elasticities  (/?,■)  and  capital  elasticities  (y,).  The  constants  (a,) 
represent  the  production  efficiency  of  the  firms  —  that  is,  the  output  for  given 
levels  of  labour  and  capital,  and  this  efficiency  is  also  allowed  to  vary 
between  firms. 

(ii)  SUR  estimates  for  a  subset  of  three  firms 

The  full  SUR  model  contains  mk  =  26  ■  3  =  78  parameters  and  will  be 
discussed  in  part  (iii)  below.  For  simplicity  we  first  consider  the  SUR  model 
for  a  subset  of  three  firms,  neglecting  the  data  of  the  other  twenty-three  firms. 
So  in  this  case  there  are  m  =  3  units  with  n  =  37  time  series  observations  per 
unit.  Panels  1  and  6  of  Exhibit  7.35  show  the  results  of  OLS  and  of  SUR 
(two-step  FGLS)  for  the  three  firms.  The  OLS  estimates  are  close  to  the 
SUR  estimates,  but  most  of  the  standard  errors  are  smaller  in  the  SUR 
model,  so  that  the  parameters  are  somewhat  more  efficiently  estimated  by 
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Exhibit  7.34  Primary  Metal  Industries  (Example  7.29) 

Yearly  production  data  over  the  period  1958-94  of  twenty-six  firms  in  the  US  primary  metal 
sector,  value  added  (a),  labour  input  ( b ),  and  capital  stock  (c),  all  in  logarithms. 


690 


7  Time  Series  and  Dynamic  Models 


Panel  1:  Dependent  Variable:  LOGPROD 

Method:  Pooled  Least  Squares  (OLS) 

Sample:  1958  1994;  Included  observations:  37 

Number  of  cross-sections:  3;  Total  panel  (balanced)  111  obs. 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C  1 

-1.087911 

0.504532 

-2.156275 

0.0334 

C  2 

1.363460 

0.500062 

2.726584 

0.0075 

C_3 

-0.339983 

0.383163 

-0.887306 

0.3770 

LOGLAB  1 

0.961796 

0.034245 

28.08564 

0.0000 

LOGLAB_2 

0.922977 

0.169924 

5.431716 

0.0000 

LOGLAB  3 

1.330223 

0.300807 

4.422174 

0.0000 

LOGCAP_l 

0.272703 

0.074273 

3.671644 

0.0004 

LOGCAP  2 

0.005695 

0.080226 

0.070987 

0.9435 

LOGCAP„3 

0.180628 

0.074909 

2.411288 

0.0177 

R-squared 

0.950496 

Log  likelihood 

62.90914 

S.E.  of  regression 

0.143215 

(b) 


_1  Residuals 


(c) 


_2  Residuals 


(d> 


(e) 


3  Residuals 


Panel  5:  Residual  correlation  matrix  OLS 

OLSRES_l  OLSRES_2  OLSRES_3 

OLSRES_l 
OLSRES  2 
OLSRES_3 

1.000000  0.053386  0.221128 

0.053386  1.000000  0.376310 

0.221128  0.376310  1.000000 

Panel  6:  Dependent  Variable:  LOGPROD 

Method:  Seemingly  Unrelated  Regression  (2-step  FGLS) 

Sample:  1958  1994;  Included  observations:  37 

Number  of  cross-sections:  3;  Total  panel  (balanced)  111  obs. 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C  1 

-1.048012 

0.336424 

-3.115150 

0.0024 

C  2 

1.478908 

0.634286 

2.331612 

0.0217 

C_3 

-0.394136 

0.275345 

-1.431427 

0.1554 

LOGLAB  1 

0.960662 

0.022609 

42.49061 

0.0000 

LOGLAB_2 

0.954174 

0.214414 

4.450154 

0.0000 

LOGLAB  3 

1.355261 

0.201169 

6.736918 

0.0000 

LOGCAP_l 

0.266874 

0.049581 

5.382560 

0.0000 

LOGCAP  2 

-0.012646 

0.101660 

-0.124391 

0.9013 

LOGCAP_3 

0.184460 

0.052020 

3.545939 

0.0006 

R-squared 

0.950448 

Log  likelihood 

77.99390 

S.E.  of  regression 

0.143284 

Exhibit  7.35  Primary  Metal  Industries  (Example  7.29) 


Cobb-Douglas  production  functions  of  three  firms  estimated  by  OLS  (Panel  1)  with  time  plots 
of  the  three  corresponding  residual  series  ((b)-(d))  and  their  correlation  matrix  (Panel  5),  and 
model  estimated  by  SUR  (two-step  FGLS,  Panel  6);  the  firms  (1,  2,  and  3)  are  indicated  by  an 
underscore  (_1,  _2,  and  _3). 
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Panel  7:  Residual  correlation  matrix  SUR  model 

SURRES_1  SURRES_2  SURRES_3 

SURRES_1 

SURRES_2 

SURRES_3 

1.000000  0.051879  0.226467 

0.051879  1.000000  0.381129 

0.226467  0.381129  1.000000 

(h) 


(7) 


Series:  ALPHA 

Observations  26 

Mean 

-0.098826 

Median 

-0.786431 

Maximum 

9.905750 

Minimum 

-6.079299 

Std.  Dev. 

2.893946 

0.7  0.8  0.9  1.0  1.1  1.2 


Series:  BETA 
Observations  26 

Mean 

0.899339 

Median 

0.900832 

Maximum 

1.195192 

Minimum 

0.650669 

Std.  Dev. 

0.134732 

-15  -10  -5  0  5  10 


(*) 


10  20  30  40  50  60 


(/) 


(m) 


-1.0  -0.5  0.0  0.5  1.0 


Series:  T- value  GAMMA 
Observations  26 
Mean  4.553846 

Median  4.697501 

Maximum  15.43294 

Minimum  —7.380802 

Std.  Dev.  5.493491 


-5  0  5  10  15 


Exhibit  7.35  ( Contd .) 


Contemporaneous  residual  correlation  matrix  corresponding  to  the  SUR  model  of  Panel  6 
(Panel  7)  and  SUR  estimates  (two-step  FGLS)  of  Cobb-Douglas  production  functions  of 
twenty-six  firms,  histograms  of  resulting  twenty-six  estimates  of  the  constant  term  a  (h),  the 
labour  elasticity  /?(/),  and  the  capital  elasticity  y  (l),  together  with  histograms  of  the  twenty-six 
corresponding  f-values  ((«),  ( k ),  and  (m)). 


FGLS.  Exhibit  7.35  (b-d)  show  time  plots  of  the  three  series  of  OLS  re¬ 
siduals.  They  follow  different  patterns  and  the  cross  correlations  between 
the  three  series  are  relatively  small  (0.05,  0.23,  and  0.38,  see  Panel  5).  The 
LM- test  for  the  significance  of  contemporaneous  cross  correlations  is 
LM  =  n{r\2  +  r\3  +  r\3)  =  37(  (0.05)2  +  (0.22)2  +  (0.38)2)  =  7.15  «  r( 3) 
with  corresponding  P-value  P  =  0.07.  So  the  null  hypothesis  that  a ,j  =  0  for 
i  j  is  not  rejected.  This  explains  that  the  SUR  and  OLS  estimates  are  close 
together. 
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(iii)  SUR  estimates  for  the  full  set  of  twenty-six  firms 

The  results  of  SUR  (two-step  FGLS)  on  the  full  set  of  m  —  26  firms  are 
summarized  in  Exhibit  7.35  (h-m)  by  means  of  histograms  of  the  resulting 
twenty-six  estimates  of  a,-,  y„  and  their  f-values.  The  estimated  capital 
elasticities  (y,)  vary  considerably  across  firms.  Several  of  these  coefficients  are 
not  significant  and  some  are  even  negative  (the  minimum  elasticity  is  —0.78, 
the  maximum  1.14).  The  estimated  labour  elasticities  (/?,■)  vary  less  (min¬ 
imum  0.65,  maximum  1.20)  and  are  all  significant. 

(iv)  Comments  on  the  outcomes 

The  SUR  model  in  part  (iii)  contains  a  large  number  of  parameters.  There  are 
mk  =  78  regression  parameters.  The  mn  x  mn  =  962  x  962  covariance 
matrix  U  is  block-diagonal,  with  symmetric  26  x  26  matrices  (with  elements 
a ij)  on  the  diagonal.  This  gives  \m(m  +  1)  =  351  additional  parameters.  In 
total,  the  SUR  model  uses  mn  =  962  data  to  estimate  (78  +  351)  =  429 
coefficients.  So  the  number  of  parameters  is  large  as  compared  to  the  amount 
of  data  information,  and  this  causes  a  lack  of  significance  in  some  of  the 
obtained  estimates.  More  significant  results  can  be  obtained  by  imposing 
parameter  restrictions  on  the  model,  as  will  be  discussed  in  the  next  section 
(see  Example  7.30). 

Exercises:  T:  7.10a-c;  E:  7.25a,  b. 


7.7.3  Panel  data 

Panel  model  with  fixed  effects 

In  some  data  sets  —  for  instance,  in  consumer  panels  —  the  number  of  units 
(; m )  is  much  larger  than  the  number  of  observations  ( n )  per  unit.  In  such 
situations  one  speaks  of  panel  data  or  longitudinal  data.  For  such  data  sets 
the  SUR  model  cannot  be  applied,  as  this  requires  that  m  <  n.  The  SUR 
model  should  then  be  simplified  to  reduce  the  number  of  parameters.  One 
way  to  get  a  manageable  model  is  to  assume  that  the  marginal  effects  of  the 
explanatory  variables  on  the  dependent  variable  are  the  same  for  all  units. 
This  corresponds  to  the  restriction  that  the  slopes  yt  =  y  are  constant  across 
units.  To  account  for  differences  between  the  units,  the  constant  terms  a,  are 
allowed  to  vary  among  units.  These  constant  terms  stand  for  all  unobserved 
aspects  that  distinguish  the  units  from  each  other.  For  instance,  in  a  consumer 
panel  this  may  capture  differences  in  unobserved  wealth  of  the  households, 
and  in  a  panel  of  firms  it  may  represent  differences  in  management  style.  A 
further  simplification  is  obtained  by  the  additional  assumption  that  the 
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random  variables  e,t  are  homoskedastic  and  uncorrelated,  both  over  time  and 
across  firms.  Under  the  above  assumptions,  the  model  becomes 

y,t  =  cc,  +  x'lty  +  sit,  e,t  ~  IID(0,  a2). 

Here  x,t  is  a  (k  —  1)  x  1  vector  of  explanatory  variables  that  does  not  include 
a  constant  term.  This  is  called  the  panel  model  with  fixed  effects,  as  the 
difference  between  the  units  is  modelled  in  terms  of  the  unit-specific,  fixed 
(but  unknown)  parameters  The  model  has  ( m  +  k)  parameters  (m  for 
a.,,  (k  —  1)  for  y,  and  1  for  a2).  As  compared  with  the  SUR  model,  which 
has  mk  +jtn(m  +  1)  parameters,  this  is  a  considerable  simplification  even 
for  moderate  values  of  m. 


Fixed  effects  model  in  matrix  form 

The  model  can  be  written  as  a  standard  multiple  regression  model  with  constant 
coefficients  by  incorporating  unit  dummies,  defined  by  Dit(j )  =1  if  i  =  j  and 
Djf  (j)  =  0  if  i  ^  j.  Then  the  model  becomes 


yu  =  ^2  UjDit(j)  +  x'ity  +  Sjt,  £it  ~  HD(0,  a1). 

7=1 

This  can  also  be  written  in  matrix  form,  as  follows.  Let  y;  be  the  ox  1  vector  with 
the  values  ylt,  t  =  1,  •  •  • ,  n,  let  £,-  be  defined  in  a  similar  way  in  terms  of  slt,  and  let 
X,  be  the  n  x  (k  —  1)  matrix  with  fth  row  x'it,  t  =  1,  •  •  • ,  n.  Further  let  i  be  the 
oxl  vector  with  all  elements  equal  to  1.  Then  the  equation  for  the  7th  unit  is 
y;  =  KXj  +  X,y  +  Sj.  Now  stack  these  equations  for  i  =  1,  •  •  • ,  m,  and  write  y  for  the 
mn  x  1  vector  consisting  of  the  stacked  y„  s  for  the  mn  x  1  vector  of  stacked  s„ 
and  X  for  the  mn  x  (k  —  1)  matrix  of  the  stacked  X,.  Further,  let  D  be  the 
following  mn  x  m  matrix  built  from  the  o  x  1  vectors  i  and  0. 


D  = 


\0  0  •••  ij 


Let  a  =  (ai,  •  •  • ,  a„,)' ;  then  the  model  can  be  written  in  matrix  form  as 


y  =  Xy  +  Da  +  e,  £~  N(0,  a2I). 


Fixed  effects  regression  by  numerically  efficient  methods 

In  the  above  model,  the  regressors  x,«  are  assumed  to  be  exogenous.  As 
var(fi)  =  a2 1,  efficient  estimators  of  the  parameters  (y  and  a)  of  this  model  are 
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obtained  by  OLS.  Direct  application  of  OLS  involves  the  inverse  of  the 
(m  +  k  —  1)  x  (m  +  k  —  1)  matrix  (X  D)'(X  D).  If  the  number  m  of  units  is 
large,  then  the  computation  of  the  inverse  of  such  a  large  matrix  is  numerically 
cumbersome.  We  now  describe  a  computationally  simpler  method  that  is  based  on 
the  partial  regression  result  of  Frisch-Waugh  in  Section  3.2.5  (p.  146).  This  result 
states  that  the  slope  parameters  y  can  be  estimated  by  means  of  the  following  two 
steps.  In  step  1,  regress  y  and  X  on  D  and  determine  the  ‘cleaned’  variables  —  that 
is,  the  residuals  of  these  two  regressions.  These  residuals  are  given  by  Moy  and 
MdX,  where  Mo  —  I  —  D(D'D)_1D'.  This  is  easy  to  compute,  since  D'D  =  nl ,  so 
that  (D'Dp1=i/.  The  ‘cleaned’  variable  Moy  has  elements  yn  —  yn  where 
yt  =  1  l  V’t  *s  average  over  the  zth  unit.  In  a  similar  way,  MqX  has  elements 
x,t  —  Xj.  In  step  2  of  partial  regression,  the  OLS  estimates  of  y  are  obtained  by 
regressing  Moy  on  MoX.  This  gives 


yOLS  =  (. X'MoXr'X'Moy 

m  n  \  1  /  m  n 

(Xit  ~  *<)(*'* _  *<■)'  ^x,t  ~  Thw  -  y.) 

1=1  t=  1  /  \  i=l  t=  1 

This  regression  involves  only  (k  —  1)  parameters,  and  we  need  to  compute  only 
the  inverse  of  the  matrix  X'MoX  that  has  size  (k  —  1)  x  (k  —  1)  that  does  not 
depend  on  the  number  m  of  units.  This  greatly  simplifies  the  computations. 
The  OLS  estimates  of  the  constants  a,  can  be  obtained  as  follows.  One  of  the 
normal  equations  in  the  matrix  model  reads  D'Xy  +  D'Da  =  D'y,  so  that 
a  =  (D'D)_1(D'y  -  D'Xy).  By  writing  this  out  we  obtain 


«/  =  Vi  -  x-y- 


Properties  of  fixed  effects  estimators 

Under  the  above  assumptions,  the  OLS  estimators  have  the  usual  properties 
discussed  in  Chapter  3.  This  means  that  t-  and  F-tests  can  be  applied  in  the  usual 
way.  Here  the  variance  s2  should  be  estimated  as  mn_,Jl+k_ i)  eft>  as  the 

regression  model  has  mn  observations  and  (m  +  k  —  1)  regression  parameters.  For 
instance,  the  null  hypothesis  of  absence  of  unit-specific  effects  (oti  =  •  •  •  =  am)  can 
be  tested  by  the  F-test  with  (m  —  1)  and  (mn  —  (m  +  k  —  I))  degrees  of  freedom. 
If  the  above  two-step  estimation  method  is  used,  then  the  standard  errors  and 
t-values  obtained  in  the  second  step  should  be  corrected,  as  the  second-step 
regression  has  only  (k  —  1)  parameters  instead  of  (m  +  k—  1)  (see  Exercise  3.9). 

The  OLS  estimator  of  y  is  consistent  if  mn  — >  oo.  It  suffices  that  m  — >  oo  with  n 
fixed,  so  that  the  marginal  effects  y  can  also  be  estimated  accurately  if  the  number 
of  observations  per  unit  is  small.  This  is  because  these  effects  are  assumed  to  be  the 
same  for  all  units.  The  OLS  estimator  of  a  is  consistent  only  for  n  — >  oo,  as 
increasing  the  number  of  units  does  not  help  to  estimate  the  constant  terms  of 
previous  units. 
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Panel  model  with  random  effects 

In  the  panel  model  with  fixed  effects,  all  unit-specific  characteristics  that  are 
constant  over  time  are  absorbed  in  the  constant  terms  a,-.  For  instance,  in  a 
consumer  panel  that  extends  over  a  limited  number  of  weeks  we  cannot 
discriminate  between  variables  like  the  sex  and  the  living  area  of  the  con¬ 
sumer.  If  we  were  to  add  these  variables  as  additional  regressors,  we  would 
obtain  perfect  collinearity,  so  these  individual-specific  effects  cannot  be 
estimated  in  the  fixed  effects  model.  In  such  situations  it  is  helpful  to  adjust 
the  model  for  the  constant  terms  a,  —  for  instance,  by  assuming  that  they 
consist  of  independent  drawings  from  an  underlying  population.  This  is  a 
realistic  assumption  in  many  cases,  because  the  units  (say  households)  in  the 
sample  are  often  randomly  drawn  from  a  larger  population  of  units.  Suppose 
that  a,~IID(a,  o\)  and  that  these  effects  are  independent  of  the  disturbances 
Sit.  Then  we  can  write  a,  =  a  +  ;/,  with  IID(0,  a2),  and 

yu  =  a  +  x'ity  +  uit,  ujjt  =  sit  +  Vi- 

This  is  called  the  panel  model  with  random  effects.  As  before,  the  (k  —  1)  x  1 
vector  of  regressors  xlt  excludes  the  constant  term.  In  this  model  the  regres¬ 
sors  Xit  may  contain  variables  that  are  constant  over  the  observed  time 
interval  but  vary  between  units,  such  as  the  sex  and  living  area  of  consumers 
or  the  location  and  management  style  of  firms.  The  above  model  has  k 
regression  parameters,  as  compared  to  (k  +  m  —  1)  in  the  panel  model  with 
fixed  effects.  If  the  number  of  units  is  large,  this  leads  to  a  considerable 
reduction  in  the  number  of  parameters.  However,  compared  to  the  fixed 
effects  model,  the  disturbances  u)lt  are  more  complex,  as  (within  units)  they 
are  correlated  over  time.  Under  the  above  assumptions  there  holds 

E[(Jit]  =  0,  E[coft\  =  (J2  +  a2,  E[uj,tuils]  =  a\  (for  t  ^  s), 

E[u!ltLOjs\  =  0  (for  all  t,  s,  and  i 


Random  effects  FGLS  estimation  by  numerically  efficient  methods 

The  parameters  (a  and  y)  in  the  random  effects  model  can  be  estimated  consist¬ 
ently  by  OLS.  However,  the  OLS  estimators  are  not  efficient  and  the  usual  OLS 
formulas  for  the  standard  errors  are  not  valid,  because  of  the  cross  correlations 
between  the  disturbances  for  the  same  unit  at  different  moments  of  time.  Efficient 
estimates  can  be  obtained  by  two-step  FGLS.  The  general  FGLS  method  was 
discussed  in  the  previous  section.  Note,  however,  that  the  nm  x  mn  covariance 
matrix  f l  of  the  disturbances  ui,t  in  the  random  effects  model  contains  many  zero 
elements  and  that  fl  depends  on  only  two  parameters,  a2  and  a2.  This  simple 
structure  of  f l  can  be  exploited  to  compute  the  estimates  in  a  numerically  efficient 
way.  We  will  illustrate  this  by  considering  the  first  step  of  the  FGLS  method  in 
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more  detail.  That  is,  we  describe  a  simple  method  to  estimate  the  two  variance 
parameters  a2  and  a2.  Once  these  two  parameters  are  estimated,  the  second  step  of 
FGLS  can  also  be  performed  in  a  straightforward  way,  but  we  do  not  present  the 
computational  details  of  this  second  step. 

The  unit-specific  disturbance  is  fixed  for  unit  i,  so  it  can  be  removed  by  taking 
the  observations  in  the  zth  unit  in  deviation  from  the  sample  mean  in  this  unit.  This 
leads  to  the  following  relation  between  the  de-meaned  variables: 

yit  ~  Ji  =  (xu  -  Xi)'y  +  (sit  -  Si). 

We  can  estimate  this  model  by  OLS,  using  in  total  nin  de-meaned  observations. 
For  each  unit,  one  degree  of  freedom  is  lost  as  the  n  de-meaned  observa¬ 
tions  of  the  zth  unit  add  up  to  zero.  Tet  y  be  the  OLS  estimator  of  y  obtained 
from  the  mn  de-meaned  data;  then  the  variance  a2  of  the  disturbances  £,y  can  be 
estimated  by 


a  = 


min 


I  m  n 

—T) 

O  i=l  t=  1 


Next,  to  estimate  the  variance  a2  between  units,  we  consider  the  averages  per  unit 
so  that 


y,  =  a  +  x\y  +  e,-  +  /  =  1,  •  •  • ,  m. 

The  error  terms  of  this  equation  have  (between-units)  variance  = 
var  (li  +  tjj)  =  \a 2  +  <r2a.  Let  a\  =  l  (Vi  T'y)2  be  the  estimated  vari¬ 

ance  in  the  above  equation.  Then  the  variance  a2  =  a\  —  iff2  can  be  estimated  by 


The  above  estimates  of  a2  and  a2  can  be  substituted  in  the  covariance  matrix  fl 
and  used  in  the  second  step  of  FGLS  to  get  efficient  estimates  of  the  parameters  a 
and  y. 

Comments  on  panel  models 

Panel  data  sets  are  becoming  increasingly  popular  in  many  fields  of  business 
and  economics.  In  this  way,  both  common  and  individual  characteristics  of 
individuals  (in  marketing),  of  firms  (in  finance),  and  of  countries  (in  inter¬ 
national  economics)  can  be  measured  and  incorporated  in  one  model. 

The  above  panel  models  can  be  extended  in  several  ways.  For  instance, 
time-specific  effects  that  are  the  same  for  all  units  can  be  modelled  by 
including  n  time  dummies  in  the  regression  model.  All  the  usual  diagnostic 
tests  discussed  before  are  of  equal  importance  in  panel  data  models.  Tests 
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for  exogeneity,  heteroskedasticity,  correlation,  lagged  effects,  unit  roots,  and 
so  on,  can  be  performed  in  panel  data  models.  The  required  methods 
and  computations  can  often  be  simplified  by  exploiting  the  special  structure 
of  the  panel  models.  Of  special  interest  are  possible  selection  effects, 
discussed  in  Section  6.3.3  (p.  500-4).  In  the  random  effects  model  it 
is  assumed  that  the  units  that  are  present  in  the  observed  panel  are 
randomly  selected,  but  this  is  often  not  the  case.  For  instance,  firms  that 
go  bankrupt  or  individuals  that  stop  buying  a  certain  product  fall  out  of 
the  panel,  and  this  selection  may  very  well  be  endogenous.  The  models 
discussed  in  Section  6.3.3  can  be  extended  to  deal  with  selection  effects 
in  panel  data. 

Example  7.30:  Primary  Metal  Industries  (continued) 

We  continue  our  analysis  of  the  production  data  of  m  —  26  firms  over  n  =  37 
years,  introduced  in  Example  7.29  in  the  foregoing  section.  We  will  discuss 
(i)  the  panel  model  with  fixed  effects,  (ii)  the  panel  model  with  random 
effects,  (iii)  the  results  of  OLS  on  a  restricted  model,  and  (iv)  comments  on 
the  obtained  results. 

(i)  Panel  model  with  fixed  effects 

We  assume  that  the  labour  elasticity  (/l)  and  the  capital  elasticity  (y)  are 
constant  across  firms.  Then  the  panel  data  model  with  fixed  firm  effects 
becomes 


log  (ylt)  =  a,  +  /?  log  (La)  +  7  log  (Kit)  +  sit. 

Here  oc,  can  be  interpreted  as  a  measure  of  production  efficiency  of  firm  i, 
because  the  larger  it  is  the  more  the  firm  produces  for  given  input  levels  of 
labour  and  capital.  The  estimated  parameters  of  this  panel  model  are  shown 
in  Panel  1  of  Exhibit  7.36.  The  estimated  labour  elasticity  is  0.84  (with 
standard  error  0.021)  and  the  estimated  capital  elasticity  is  0.17  (0.020). 
The  constants  a,  differ  significantly  across  firms.  This  is  visualized  by  means 
of  a  histogram  in  Exhibit  7.36  (c),  and  the  F- test  for  equal  constants  in  Panel 
2  has  P-value  P  =  0.000. 

(ii)  Panel  model  with  random  effects 

Next  we  estimate  the  panel  model  with  random  effects.  The  corresponding 
FGLS  estimates  are  reported  in  Panel  4  of  Exhibit  7.36.  The  estimated 
elasticity  of  labour  is  0.82  (with  standard  error  0.019)  and  that  of  capital  is 
0.16  (0.018).  These  results  are  very  close  to  the  ones  obtained  in  part  (i)  in 
the  fixed  effects  model. 
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Panel  1:  Dependent  Variable:  LOGPROD 

Method:  Panel  data  model  (FIXED  EFFECTS)  OLS 

Sample:  1958  1994;  Included  observations:  37 

Number  of  cross-sections:  26;  Total  panel  (balanced)  observations:  962 

Variable 

Coefficient 

Std.  Error 

t- Statistic 

Prob. 

LOGLAB 

0.839131 

0.021278 

39.43618 

0.0000 

LOGCAP 

0.174129 

0.020315 

8.571613 

0.0000 

C  1 

-0.242098 

0.234578 

-1.032058 

0.3023 

C  2 

0.013008 

0.149059 

0.087267 

0.9305 

C  3 

0.103623 

0.149092 

0.695030 

0.4872 

C  4 

0.080292 

0.155827 

0.515261 

0.6065 

C  5 

0.099636 

0.152249 

0.654429 

0.5130 

C  6 

-0.283235 

0.186157 

-1.521489 

0.1285 

C  7 

-0.331813 

0.140583 

-2.360265 

0.0185 

C  8 

-0.008232 

0.133647 

-0.061594 

0.9509 

C_9 

-0.187380 

0.161702 

-1.158795 

0.2468 

C  10 

0.294013 

0.153477 

1.915677 

0.0557 

C_ll 

0.237625 

0.115523 

2.056952 

0.0400 

C  12 

-0.086367 

0.135793 

-0.636019 

0.5249 

C  13 

0.150745 

0.177403 

0.849735 

0.3957 

C  14 

0.254307 

0.141531 

1.796823 

0.0727 

C_15 

0.272252 

0.150234 

1.812182 

0.0703 

C_16 

0.040844 

0.161199 

0.253374 

0.8000 

C  17 

-0.138201 

0.173634 

-0.795935 

0.4263 

C  18 

-0.132822 

0.160190 

-0.829154 

0.4072 

C_19 

0.059361 

0.127754 

0.464654 

0.6423 

C  20 

0.229970 

0.149753 

1.535659 

0.1250 

C  21 

0.221035 

0.170087 

1.299545 

0.1941 

C  22 

-0.163537 

0.159114 

-1.027793 

0.3043 

C  23 

-0.045883 

0.133522 

-0.343636 

0.7312 

C  24 

-0.004967 

0.135952 

-0.036532 

0.9709 

C_25 

0.111176 

0.143650 

0.773934 

0.4392 

C_26 

0.176474 

0.135686 

1.300602 

0.1937 

R-squared 

0.954206 

S.E.  of  regression 

0.247219 

(b) 


Panel  2:  Wald  Test:  Equality  of  constants  (25  restrictions) 

F-statistic  15.98583  Probability  0.000000 


(c) 

4 


I 


-0.3  -0.2  -0.1  0.0  0.1  0.2  0.3 


Series:  CONST 

Observations  26 

Mean 

0.027686 

Median 

0.050102 

Maximum 

0.294013 

Minimum 

-0.331813 

Std.  Dev. 

0.181042 

Skewness 

-0.323889 

Kurtosis 

2.073341 

Jarque-Bera 

1.384841 

Probability 

0.500363 

Exhibit  7.36  Primary  Metal  Industries  (Example  7.30) 


Estimated  production  function  of  twenty-six  firms  in  panel  data  model  with  fixed  effects  (Panel 
1,  the  constants  a,-  are  denoted  by  C_i),  F-test  on  equality  of  the  twenty-six  firm-specific 
constants  (Panel  2),  and  histogram  of  these  twenty-six  constants  (c). 
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(d) 


Panel  4:  Dependent  Variable:  LOGPROD 

Method:  Panel  data  model  (RANDOM  EFFECTS)  FGES 

Sample:  1958  1994;  Included  observations:  37 

Number  of  cross-sections:  26;  Total  panel  (balanced)  observations:  962 


Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

0.133174 

0.128244 

1.038439 

0.2993 

LOGLAB 

0.818331 

0.018680 

43.80877 

0.0000 

LOGCAP 

0.162876 

0.018043 

9.027196 

0.0000 

Panel  5:  Dependent  Variable:  LOGPROD 

Method:  Pooled  Least  Squares  (all  coefficients  constant  across  firms) 
Sample:  1958  1994;  Included  observations:  37 

Number  of  cross-sections:  26;  Total  panel  (balanced)  observations:  962 
Variable  Coefficient  Std.  Error  t-Statistic  Prob. 

C  0.036817  0.094349  0.390220  0.6965 

LOGLAB  0.764332  0.015065  50.73630  0.0000 

LOGCAP  0.184513  0.014828  12.44331  0.0000 


Panel  6:  Dependent  Variable:  LOGPROD 

OLS;  Sample:  1958  1958;  Number  of  cross-sections:  26;  Total  obs:  26 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

0.214441 

0.341991 

0.627037 

0.5368 

LOGLAB 

0.804157 

0.070529 

11.40184 

0.0000 

LOGCAP 

0.136557 

0.058223 

2.345402 

0.0280 

Panel  7:  Dependent  Variable:  LOGPROD 

OLS;  Sample:  1994  1994;  Number  of  cross-sections:  26;  Total  obs:  26 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-0.159270 

0.825574 

-0.192920 

0.8487 

LOGLAB 

0.730405 

0.097399 

7.499110 

0.0000 

LOGCAP 

0.241949 

0.120829 

2.002401 

0.0572 

Exhibit  7.36  ( Contd .) 

Estimated  production  function  of  twenty-six  firms  in  panel  data  model  with  random 
effects  (FGLS,  Panel  4),  OLS  estimates  in  panel  model  with  all  coefficients  (including  the 
constant  term)  fixed  across  firms  for  the  full  data  period  (Panel  5),  for  the  first  year  (1958, 
Panel  6),  and  for  the  last  year  in  the  sample  (1994,  Panel  7). 


(iii)  Results  of  OLS  on  a  restricted  model 

Next  we  impose  the  restriction  that  all  firms  are  equally  efficient,  so  that  oc,  is 
constant  across  firms.  The  corresponding  OLS  estimates  (with  mn  =  962 
observations)  are  given  in  Panel  5  of  Exhibit  7.36.  The  elasticity  of  labour  is 
estimated  as  0.76  (with  standard  error  0.015),  that  of  capital  as  0.18  (stand¬ 
ard  error  0.015).  Panels  6  and  7  of  Exhibit  7.36  show  the  results  of  cross 
section  regressions  for  the  first  year  1958  and  for  the  last  year  1994  in  the 
sample  (both  with  m  =  26  observations).  The  results  do  not  show  large 
variations  in  the  estimated  elasticities  for  different  years.  In  all  cases,  the 
labour  elasticity  is  much  larger  than  the  capital  elasticity. 
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(iv)  Comments  on  the  obtained  results 

The  above  results  remain  basically  the  same  if  trends  are  incorporated  in  the 
model.  Note  that  simpler  models  —  that  is,  models  with  less  parameters  — 
give  more  significant  estimates  (higher  t-values)  of  the  labour  and  capital 
elasticities.  For  the  OLS  model  with  all  parameters  fixed  across  firms  the  t- 
values  are  (50.7,  12.4),  for  the  random  effects  model  (43.8,  9.0),  and  for  the 
fixed  effects  model  (39.4,  8.6),  whereas  for  the  SUR  model  in  Example  7.29 
the  averages  of  the  f-values  over  the  twenty-six  firms  are  (21.9,  4.6).  Smaller 
models  often  have  larger  t-values,  because  the  use  of  less  parameters  provides 
a  gain  in  degrees  of  freedom.  However,  the  required  parameter  restrictions 
are  not  so  much  supported  by  the  data.  For  instance,  in  Example  7.29  the 
elasticities  vary  considerably  across  firms,  so  that  the  panel  data  model  is  not 
supported.  And  the  constants  in  the  fixed  effects  panel  model  differ  signifi¬ 
cantly,  so  that  the  regression  model  with  fixed  parameters  (a,  /?,  y)  involves 
unrealistic  restrictions.  For  these  data,  the  SUR  model  is  most  appropriate 
to  model  the  differences  between  firms.  Both  panel  models  (with  fixed  or 
random  effects)  give  a  useful  description  of  the  average  structure  of  produc¬ 
tion  in  primary  metal  industries  and  of  possible  differences  in  efficiency 
between  firms. 

Exercises:  E:  7.25c-f. 


7.7.4  Simultaneous  equation  model 

Model  formulation 

Historically,  the  first  type  of  multiple  equation  models  that  was  developed 
within  econometrics  is  the  simultaneous  equation  model  (SEM).  This  model 
has  the  same  structure  as  the  SUR  model  ylt  =  a,  +  x'ityi  +  e,f,  but  the  crucial 
difference  is  that  in  the  equation  for  ylt  some  of  the  regressors  xlt  consist  of 
endogenous  variables  y/f,  j  ^  i.  We  split  the  set  of  all  variables  that  appear  in 
the  m  equations  into  two  groups,  the  group  of  m  endogenous  variables  yjt 
and  a  group  of  k  exogenous  variables  Zjt  (including  the  constant  term).  Then 
the  equations  can  be  written  as 

k 

y,t  =  Y  A/W  +  Y  P>izif  +  £it’  f  =  1,  •  •  • ,  W,  t  =  1,  •  •  • ,  n. 
m  /=i 

This  model  contains  an  equation  for  each  of  the  m  endogenous  variables.  If 
all  the  coefficients  y (/  are  zero,  then  the  model  reduces  to  the  SUR  model.  The 
model  is  called  simultaneous  if  y(/  ^  0  for  some  /  ^  i,  because  in  this  case  ylt 
depends  on  the  endogenous  variable  y/t  that  itself  is  explained  by  another 
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equation  in  the  model.  In  particular,  if  for  some  j  7^  i  both  ytj  7^  0  and  y/;  7^  0, 
then  yn  and  jjt  depend  on  each  other  simultaneously. 

Inconsistency  of  OLS 

Because  the  equations  of  the  SEM  contain  endogenous  regressors,  OLS  is  not 
consistent.  This  was  analysed  for  single  equation  regression  models  in 
Section  5.7.  To  illustrate  this  for  the  SEM,  we  consider  the  model  with 
m  =  2  equations  where  y12  7^  0  and  y21  7^  0-  The  model  equations  are 

k 

yu  —  ynjit  +  PyZjt  +  sit, 

7=1 

k 

yu  =  Tuyit  +  Pijzit  +  £i  t- 
7=1 

The  regressor  y2f  in  the  first  equation  is  endogenous  (that  is,  it  is  correlated 
with  the  error  term  S\t),  as  it  depends  on  y\t  and  hence  on  E\t  because  of  the 
second  equation.  OLS  is  inconsistent  because  y\t  and  y2f  depend  on  each 
other  simultaneously. 

Historically,  SEM  are  mostly  used  for  yearly  macroeconomic  data  where 
variables  like  national  income,  consumption,  imports,  and  exports  depend 
mutually  on  each  other.  Before  we  discuss  the  general  simultaneous  equation 
model  further,  we  first  consider  a  simple  simulation  example  to  illustrate  the 
main  ideas. 

Example  7.31:  Simulated  Macroeconomic  Consumption  and 
Income 

We  simulate  data  from  a  simple  macroeconomic  model  that  consists  of  two 
equations,  a  consumption  equation  and  an  income  equation.  We  will  discuss 
(i)  the  model  and  the  parameter  of  interest,  (ii)  an  analysis  of  the  OLS  bias, 
(iii)  the  simulated  data,  (iv)  the  results  of  OLS  and  IV,  and  (v)  a  graphical 
interpretation. 

(i)  The  model  and  the  parameter  of  interest 

The  variables  are  consumption  (denoted  by  Q),  disposable  income  (denoted 
by  D t),  and  government  expenditures  (denoted  by  Gt),  and  the  model  is 

consumption  equation:  Q  =  a  +  flDt  +  £u, 
income  equation:  Dt  =  Ct  +  Gt  +  £20 


E 
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Here  the  government  expenditures  are  assumed  to  be  exogenous  (that  is, 
independent  of  E\t  and  S2t),  and  the  (endogenous)  dependent  variables  are 
consumption  and  income.  The  error  terms  £\t  and  &2t  are  mutually  independ¬ 
ent.  The  parameter  of  interest  in  this  simple  Keynesean  model  is  the  multi¬ 
plier —  that  is,  the  average  effect  of  government  expenditures  on  income.  To 
determine  this  effect,  we  substitute  the  first  equation  in  the  second  one  and 
solve  for  Dt,  with  the  result  that 


Dt 


a  t  1 
1  ~p  +  l -p 


Gt  + 


Sit  +  £2 1 

~T^]T 


So  the  multiplier  is  equal  to 


(7.42) 


(ii)  Analysis  of  the  OLS  bias 

The  multiplier  can  be  obtained  by  estimating  the  marginal  effect  [l  of  income 
on  consumption.  If  we  would  neglect  the  income  equation  and  estimate  /? 
simply  by  applying  OLS  to  the  consumption  equation,  then  the  resulting 
estimator  is  not  consistent.  The  reason  is  that  the  regressor  Dt  in  the  con¬ 
sumption  equation  is  not  exogenous.  This  follows  from  (7.42),  which  shows 
that  Df  is  correlated  with  the  error  term  The  (asymptotic)  bias  of  OLS  — 
that  is,  plim (b)  —  ft  —  follows  from 


plim  (b)  =  plim 


£(Df-D)(C,-C) 


=  P  + 


V  Z(Dt-D)2 
co  v(Df,elf) 
var  (Df) 


=  P  +  plim 


X(Dt-D)(eit-ei) 

±£(Df-D)2 


Let  o\  be  the  variance  of  £\t,  a\  that  of  £2?,  and  a2G  that  of  Gt.  Since 
cov(Gf,  £1 1)  =  0  by  assumption,  as  Gt  is  exogenous,  and  since  E\t  and  £2? 
are  assumed  to  be  independent,  it  follows  from  (7.42)  that  co v(Df,  S\t)  = 
a\j{  1  —  /?)  and  var (Dt)  =  (a2c  +  a\  +  u2)/(l  —  P)2.  We  conclude  that 


plim(£>)  =  P  + 


aG  +  a\  +  °2 


So  the  inconsistency  of  OLS  is  relatively  small  if  a2G  is  large  compared  to 
<j\  —  that  is,  if  the  systematic  variation  (due  to  the  variable  Gt)  is  large 
compared  to  the  random  variation  E\t  in  the  consumption  function. 


(iii)  Simulated  data 

We  simulate  n  =  100  observations  from  this  model,  as  follows.  As  parameter 
values  we  take  a  =  0  and  ft  =  0.5,  so  that  the  multiplier  =  2.  The  error 
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(a) 


Panel  1:  Dependent  Variable:  CONS 
Method:  Least  Squares 

Included  observations:  100 _ 

Variable  Coefficient  Std.  Error  t-Statistic  Prob. 

C  -3.041052  0.417545  -7.283168  0.0000 

INC  0.656743  0.021417  30.66457  0.0000 


( b ) 


Panel  2:  Dependent  Variable:  CONS 
Method:  Instrumental  Variables 
Included  observations:  100 
Instrument  list:  C  GOV 

Variable  Coefficient  Std.  Error  t-Statistic  Prob. 

C  -0.136355  1.018765  -0.133843  0.8938 

INC  0.505148  0.052938  9.542177  0.0000 


(c)  id) 


if) 


Exhibit  7.37  Simulated  Macroeconomic  Consumption  and  Income  (Example  7.31) 

Results  of  simulated  consumption  and  income  data  in  a  simultaneous  equation  model, 
OLS  estimate  of  the  consumption  function  (Panel  1)  and  IV  estimate  (Panel  2);  ((c)-(d))  show 
scatter  diagrams  of  the  consumption  and  income  data  and  the  lines  fitted  by  OLS  (c)  and  IV  (d); 
((e)— (/"))  show  the  two  DGP  relations  ( e )  and  these  relations  together  with  the  estimated  OLS 
line  (jf). 
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terms  i'.\t  and  £2 1  are  independently  drawn  from  the  standard  normal  distri¬ 
bution,  and  Gt  is  independent  from  E\t  and  £2?  and  drawn  from  the  normal 
distribution  with  mean  10  and  variance  1.  Then  Dt  is  obtained  from  (7.42), 
and  finally  C?  is  obtained  from  the  consumption  equation.  As  fi  =  0.5  and 
a\  ~  °2  =  °g  =  1’  ^  follows  from  (ii)  above  that  plim(f?)  =  0.5  +  (0.5/3)  = 
0.67.  That  is,  if  we  estimate  fi  by  OLS,  we  can  expect  to  find  a  multiplier 
1/(1  —  b)  of  around  3,  whereas  the  actual  multiplier  is  only  2.  So  OLS  greatly 
overestimates  the  effects  of  government  spending  on  income. 

(iv)  Results  of  OLS  and  IV 

Panel  1  of  Exhibit  7.37  shows  the  results  of  OLS  for  the  simulated  data.  The 
estimated  value  of  fi  is  b  =  0.657  with  corresponding  multiplier 
1/(1  —  b)  =  2.9.  As  was  discussed  in  Section  5.7.2  (p.  404-5),  consistent 
estimators  can  be  obtained  by  the  method  of  instrumental  variables.  As  the 
consumption  equation  contains  two  parameters  (a  and  /?),  we  need  two 
instruments.  Since  the  government  expenditures  are  assumed  to  be  exogen¬ 
ous,  we  can  take  Gt  and  the  constant  term  as  instruments.  The  corresponding 
two-stage  least  squares  estimates  are  shown  in  Panel  2  of  Exhibit  7.37.  The 
IV  estimate  of  fi  is  biy  =  0.505  with  multiplier  1/(1  —  biv)  =  2.0,  so  that 
these  estimates  are  much  more  reliable  than  OLS. 

(v)  Graphical  interpretation 

Exhibit  7.37  (c)  and  (d)  show  the  scatter  diagrams  of  the  simulated  consump¬ 
tion  and  income  data,  together  with  the  fitted  OLS  and  IV  lines.  Clearly,  the 
IV  line  fits  the  scatter  less  nicely  than  the  OLS  line.  However,  the  location  of 
the  points  (Ct,  Dt)  in  the  scatter  is  determined  not  only  by  the  consumption 
equation,  but  also  by  the  income  equation.  Exhibit  7.37  ( e )  shows  the  two 
equations  that  result  for  £1  =  0,  £2  =  0  and  Gt  =  10  —  that  is,  by  substitut¬ 
ing  the  mean  values  of  these  random  variables  in  the  two  model  equations.  In 
our  simulation  example,  these  relations  become  Cf  =  a  +  /iDt  =  0.5 Dt  for 
the  consumption  equation,  and  Dt  =  Ct  +  10  or  equivalently  Ct  =  Dt  —  10 
for  the  income  equation.  The  data  ( Ct ,  Dt)  satisfy  both  these  equations 
(apart  from  random  variations  in  s\t,  £2?  and  Gt ).  The  first  equation  has 
slope  0.5,  but  the  second  equation  has  slope  1.  As  OLS  tries  to  find  the  line 
closest  to  the  scatter,  the  resulting  OLS  line  has  a  slope  somewhere  between 
0.5  and  1.  This  is  illustrated  in  Exhibit  7.37  (f). 

Estimation  by  2SLS  and  the  order  condition 

Now  we  return  to  the  general  simultaneous  equation  model.  In  Section  5.7.1 
(p.  398-400)  we  described  the  method  of  instrumental  variables  to  get 
consistent  estimators  of  the  parameters  of  an  equation  with  endogenous 
regressors.  In  an  SEM  this  method  can  be  applied  per  equation  with  the 
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exogenous  variables  Zjt  as  instruments.  This  is  called  the  two-stage  least 
squares  (2SLS)  method.  For  each  equation  i  =  1,  •  •  • ,  m,  the  parameters  of 
that  equation  are  estimated  by  the  following  two  steps. 


Estimation  of  simultaneous  equation  model  by  2SLS 

•  Step  1:  Regress  the  endogenous  regressors  on  the  exogenous  ones.  For  each 
of  the  regressors  ylt  that  appear  in  the  model  equation  (that  is,  for  which 
yij  ^  0),  perform  a  regression  of  jjt  on  the  set  of  all  k  exogenous  variables 
{Zjt,  j  =  1,  •  •  • ,  k).  Let  jjt  be  the  fitted  values  of  these  regressions. 

•  Step  2:  Regress  yit  on  yjt  and  Zjt  ■  In  the  equation  for  y,t,  replace  the  regressors 
jjt  by  the  fitted  values  of  step  1  and  estimate  the  parameters  by  OLS  in  the 
equation  yit  =  E7y,  lifijt  +  E,t i  P,,Zjt  +  £it,  t=l,---,n. 


As  was  discussed  in  Section  5.7.1,  this  method  requires  that  the  number  of 
instrumental  variables  that  do  not  appear  in  the  equation  is  at  least  as  large  as 
the  number  of  endogenous  regressors  in  the  equation  —  the  so-called  order 
condition.  Let  m,  be  the  number  of  endogenous  regressors  y]t  (j  ^  i)  and  let  k, 
be  the  number  of  exogenous  regressors  Zjt  that  appear  on  the  right-hand  side 
of  the  equation  for  yit.  Then  the  order  condition  is  that 

k  —  k,  >  mt. 

Equivalently,  the  condition  is  that  (k  —  kt)  +  (m  —  m ,  —  1)  >  (m  —  1),  where 
(k  —  kj)  is  the  number  of  exogenous  variables  and  ( m  —  mt  —  1)  is  the  number 
of  endogenous  variables  that  do  not  appear  in  the  7th  equation.  In  words,  the 
order  condition  means  that  the  total  number  of  excluded  variables  from  the 
7th  equation  should  be  at  least  (m  —  1).  That  is,  for  every  equation  at  least 
(m  —  1)  of  the  (m  —  1  +  k)  parameters  appearing  on  the  right-hand  side  of 
that  equation  should  be  set  equal  to  zero.  Such  restrictions  are  called  exclu¬ 
sion  restrictions  or  identification  restrictions.  The  restrictions  should  be 
based  on  economic  theory. 


Estimation  by  3SLS 

The  2SLS  method  is  a  single  equation  method  that  neglects  the  possible  contem¬ 
poraneous  covariances  <r,j  =  Efe^e^]  between  the  error  terms  r,t  and  e/f.  Therefore 
the  2SLS  estimators  are  consistent  but  not  (asymptotically)  efficient.  In  the  SEM, 
the  error  terms  r.lt  are  assumed  to  be  uncorrelated  over  time  but  possibly  contem¬ 
poraneously  correlated  across  the  m  equations.  The  SEM  then  has  the  same 
structure  as  the  SUR  model  (7.41)  of  Section  7.7.2,  with  the  difference  that 
some  of  the  regressors  are  endogenous.  The  cross  equation  correlations  can  be 
treated  in  a  similar  way  to  that  discussed  in  Section  7.7.2  for  the  SUR  model,  by 
applying  two-step  FGLS  to  the  system  of  equations.  This  leads  to  the  following 
system  method  for  the  joint  estimation  of  the  parameters  of  all  m  model  equations. 
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Estimation  of  simultaneous  equation  model  by  3SLS 

•  Steps  1  and  2:  Apply  2SLS.  Apply  2SLS  to  each  of  the  m  equations  separately. 
Estimate  the  mn  x  mn  covariance  matrix  of  the  system,  as  described  in  step 
1  of  the  two-step  FGLS  estimator  for  SUR  models  in  Section  7.7.2. 

•  Step  3:  Apply  GLS  to  the  system  of  equations.  Apply  step  2  of  the  two-step 
FGLS  estimator  for  SUR  models  in  Section  7.7.2. 


This  is  called  the  three-stage  least  squares  (3SLS)  method.  This  method  is  asymp¬ 
totically  efficient,  as  it  is  equivalent  to  ML  (if  the  error  terms  are  jointly  normally 
distributed,  as  discussed  in  Section  7.7.2).  Note  that  the  3SLS  method  uses  the 
jin(m+  1)  estimated  cross  covariances  s,;  that  are  based  on  the  2SLS  residual 
series  of  length  n.  If  the  sample  size  n  is  not  so  large,  then  these  covariances  may 
not  be  estimated  very  reliably.  Therefore  in  practice,  when  the  available  sample 
size  is  not  so  large,  one  often  uses  2SLS.  The  2SLS  method  also  has  the  advantage 
that  the  estimator  for  the  zth  equation  remains  consistent  if  another  equation  j  ^  i 
is  not  correctly  specified.  On  the  other  hand,  3SLS  uses  all  m  equations  to  estimate 
the  zth  equation,  and,  if  one  equation  is  wrongly  specified,  then  in  general  all 
parameters  are  estimated  inconsistently.  Summarizing,  3SLS  is  a  good  method  if 
one  has  enough  confidence  in  the  specification  of  all  the  model  equations  and  if 
sufficiently  long  time  series  of  the  variables  are  available.  Otherwise  2SLS  is 
preferred. 


Dynamic  simultaneous  models 

In  some  cases  the  SEM  equations  contain  not  only  endogenous  variables 
yjt  (j  7^  i)  and  exogenous  variables  Zjt  as  regressors,  but  also  lagged  values  of 
Jjt  (/ =  1,  This  is  called  a  dynamic  simultaneous  equation  model. 

The  lagged  endogenous  variables  yj  t-k  (with  k  >  1)  are  (contemporan¬ 
eously)  uncorrelated  with  the  disturbance  terms  e,y  so  that  they  provide 
proper  instruments.  The  2SLS  estimates  can  be  computed  as  before,  taking 
as  instruments  all  exogenous  and  all  lagged  endogenous  variables.  A  dynamic 
SEM  can  also  be  written  as  follows,  where  we  bring  all  current  and  lagged 
endogenous  variables  to  the  left-hand  side  of  the  equations: 

P 

®(L)Yt  =  Bz,  +  e„  O(L)  =  T  -  ^  fyU, 

7=1 

where  the  m  x  m  matrix  T  has  elements  1  on  the  diagonal  and  elements  — 
for  i  ^  /  (the  sign  changes  because  the  current  endogenous  variables  are 
shifted  from  the  right-hand  side  to  the  left-hand  side  of  the  equations). 
This  is  a  VAR  model  with  m  endogenous  variables  Yt  =  (y\t,  . . .  ,ymt)'  and 
with  k  exogenous  variables  zt  =  {z\t,  ■  ■  ■  ,Zkt)' ■  The  only  distinction  is  that 
in  the  standard  VAR  model  the  polynomial  matrix  O(L)  is  of  the  form 
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O(L)  =  I  —  ®jL',  whereas  in  the  SEM  the  identity  matrix  is  replaced  by 

the  unknown  invertible  m  x  m  matrix  T,  which  contains  the  parameters 
y!7,  j  ^  i.  A  standard  VAR  model  is  obtained  if  the  matrix  equations  are 
premultiplied  by  the  inverse  of  the  matrix  T.  After  this  transformation,  the 
resulting  VAR  model  can  be  estimated  as  described  in  Sections  7.6.2  and 
7.6.3.  However,  in  case  one  is  interested  in  the  parameters  y,.,  then  2SLS 
provides  more  information,  since  the  matrix  T  is  lost  in  the  transformed  VAR 
model. 


Example  7.32:  Interest  and  Bond  Rates  (continued) 

We  continue  our  previous  analysis  of  the  changes  in  the  AAA  bond  rate  A yt 
and  in  the  three-month  Treasury  Bill  rate  Axt.  In  Example  7.26  we  con¬ 
sidered  a  VAR  model  for  these  stationary  data.  Now  we  will  discuss  (i)  the 
motivation  of  a  simultaneous  model  for  these  data,  (ii)  estimation  by  2SLS, 
and  (iii)  interpretation  of  the  outcomes. 

(i)  Motivation  of  a  simultaneous  model 

In  Chapter  5  we  considered  the  equation  Ayt  =  a.  +  [3Axt  +  et.  We  concluded 
that  the  residuals  of  this  equation  are  serially  correlated  (see  Example  5.22 
(p.  365))  and  that  the  regressor  Axt  is  endogenous  (see  Example  5.33 
(p.  414-16)).  We  add  lagged  variables  to  account  for  the  serial  correlation 
and  we  add  an  equation  for  Ax*-  to  account  for  the  endogeneity  of  this 
variable.  We  consider  the  following  dynamic  SEM,  which  contains  a  con¬ 
stant  and  lagged  endogenous  variables  (but  no  other  exogenous  variables)  as 
explanatory  variables: 

Ayt  =  cci  4-  y^A xt  +  /?nAy*_i  +  /l^Axf-i  +  £i*, 

Axt  =  a.2  +  721  +  Pn^yt-i  +  +  e2o 

(ii)  Estimation  by  2SLS 

We  estimate  the  above  equations  by  OLS  (as  would  be  appropriate  if  the  two 
ADL  equations  would  not  be  simultaneous)  and  by  2SLS,  using  the  data  over 
the  period  from  January  1990  to  December  1999  (so  that  n  =  120).  Both 
equations  contain  four  parameters,  and  in  2SLS  we  use  five  instruments  (the 
constant  and  Ayf_i,  Ay*_2,  Ax*_i,  and  Axt-i).  The  results  are  in  Exhibit 
7.38.  The  OLS  estimates  of  y12  (in  Panel  1)  and  y2i  (in  Panel  3)  suggest 
significant  contemporaneous  effects  between  Ax*  and  A yt  (the  f-value  is 
4.08).  However,  if  this  is  the  case,  then  the  OLS  estimators  are  biased  and 
the  OLS  standard  errors  are  wrong.  So  the  results  of  OLS  cannot  be  trusted. 
On  the  other  hand,  the  2SLS  estimates  show  that  the  contemporaneous 
effects  are  not  at  all  significant  (y12  has  t  =  —0.03  in  Panel  2  and  y2i  has 
t  =  —0.08  in  Panel  4).  Therefore  the  matrix  T  is  diagonal,  so  that  the  model 
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reduces  to  a  VAR  model.  In  Example  7.26  we  estimated  VAR  models  for 
these  data  (see  Panels  2-4  of  Exhibit  7.30). 

(iii)  Interpretation  of  the  outcomes 

The  result  of  this  example  is  illustrative  for  many  applications.  The  mutual 
dependence  between  variables  can  often  be  modelled  well  in  terms  of  a  VAR, 
where  each  variable  depends  on  the  past  of  all  other  variables  but  not  on  the 


Panel  1:  Dependent  Variable:  DAAA 

Method:  Least  Squares 

Sample:  1990:01  1999:12;  Included  observations:  120 

Variable 

Coefficient 

Std.  Error 

t- Statistic 

Prob. 

C 

-0.003077 

0.013533 

-0.227360 

0.8205 

DUS3MT  yu  = 

0.330442 

0.080906 

4.084257 

0.0001 

DUS3MT(— 1) 

-0.126609 

0.084199 

-1.503688 

0.1354 

DAAA(-l) 

0.307879 

0.090652 

3.396253 

0.0009 

Panel  2:  Dependent  Variable:  DAAA 

Method:  Two-Stage  Least  Squares 

Sample:  1990:01  1999:12;  Included  observations:  120 
Instrument  list:  C  DAAA(  — 1)  DAAA(— 2)  DUS3MT(  — 

1)  DUS3MT(- 

-2) 

Variable 

Coefficient 

Std.  Error 

t- Statistic 

Prob. 

C 

-0.006983 

0.015690  - 

-0.445016 

0.6571 

DUS3MT 

y12  =  -0.013614 

0.524615  - 

-0.025951 

0.9793 

DUS3MT  (-1) 

-0.033164 

0.167148  - 

-0.198409 

0.8431 

DAAA(-l) 

0.385929 

0.152556 

2.529756 

0.0128 

Panel  3:  Dependent  Variable:  DUS3MT 

Method:  Least  Squares 

Sample:  1990:01  1999:12;  Included  observations:  120 

Variable 

Coefficient 

Std.  Error 

t- Statistic 

Prob. 

C 

-0.008754 

0.014502  - 

-0.603618 

0.5473 

DAAA  y21  = 

0.380471 

0.093156 

4.084257 

0.0001 

DAAA(-l) 

0.081194 

0.101716 

0.798248 

0.4264 

DUS3MT(— 1) 

0.285623 

0.087285 

3.272314 

0.0014 

Panel  4:  Dependent  Variable:  DUS3MT 

Method:  Two-Stage  Least  Squares 

Sample:  1990:01  1999:12;  Included  observations:  120 
Instrument  list:  C  DAAA(  — 1)  DAAA(— 2)  DUS3MT(  — 

1)  DUS3MT(- 

-2) 

Variable 

Coefficient 

Std.  Error 

t-Statistic 

Prob. 

C 

-0.012350 

0.021047  - 

-0.586807 

0.5585 

DAAA 

y21  =  -0.146298 

1.938574  - 

-0.075467 

0.9400 

DAAA(-l) 

0.282863 

0.749923 

0.377190 

0.7067 

DUS3MT(— 1) 

0.266205 

0.121696 

2.187460 

0.0307 

Exhibit  7.38  Interest  and  Bond  Rates  (Example  7.32) 

Two  estimates  of  the  equation  that  explains  the  changes  in  the  AAA  bond  rate  in  terms  of  the 
changes  in  the  three-month  Treasury  Bill  rate  and  lagged  values  (OLS  in  Panel  1,  2SLS 
in  Panel  2),  and  two  estimates  of  the  equation  that  explains  the  changes  in  the  three-month 
Treasury  Bill  rate  in  terms  of  the  changes  in  the  AAA  bond  rate  and  lagged  values  (OLS  in 
Panel  3,  2SLS  in  Panel  4). 
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present  values  of  these  variables.  An  advantage  of  VAR  models  is  that  they  do 
not  need  the  exclusion  restrictions  that  are  required  in  an  SEM  to  satisfy  the 
order  condition  for  each  equation.  However,  for  models  with  a  large  number 
of  endogenous  variables,  a  VAR  is  not  feasible  anymore,  as  it  contains  too 
many  parameters.  In  such  situations  a  dynamic  SEM  may  be  preferred, 
provided  that  one  can  specify  credible  exclusion  restrictions. 

Exercises:  T:  7.11;  E:  7.26. 


7.7.5  Summary 

In  this  section  we  have  considered  econometric  models  for  data  sets  that 

consist  of  combined  cross  sections  of  time  series  variables. 

•  In  SUR  models,  the  effects  of  the  explanatory  variables  on  the  dependent 
variable  are  different  for  all  units.  The  relation  between  different  units  is 
modelled  by  the  contemporaneous  correlation  between  the  error  terms. 
This  model  requires  that  the  number  of  units  in  the  data  set  does  not 
exceed  the  length  of  the  observed  time  series  per  unit. 

•  Panel  models  are  used  for  data  sets  with  a  large  number  of  units.  The 
effect  of  explanatory  variables  is  the  same  across  all  units,  and 
the  differences  between  units  are  modelled  by  the  constant  term.  In  the 
fixed  effects  model  each  unit  has  its  own  parameter  for  the  constant 
term;  in  the  random  effects  model  these  parameters  are  supposed  to  be 
drawn  from  an  underlying  population. 

•  The  simultaneous  equation  model  consists  of  a  set  of  equations  for  a 
number  of  endogenous  variables  that  influence  each  other  simultan¬ 
eously.  Estimation  requires  that  the  equations  are  identified  in  the 
sense  that  each  equation  excludes  a  sufficient  number  of  variables. 

•  Apart  from  differences  in  the  regression  equations,  the  above  models  are 
each  characterized  by  special  structures  of  the  covariance  matrix  of  the 
error  terms.  A  general  estimation  method  for  models  with  such  covar¬ 
iance  structures  is  (F)GLS.  The  actual  computations  can  be  simplified  by 
exploiting  the  specific  error  structure  of  SUR  models,  panel  models,  and 
simultaneous  equation  models. 
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Summary,  further  reading, 
and  keywords 


SUMMARY 

In  this  chapter  we  discussed  the  modelling  of  time  series  data.  Many  eco¬ 
nomic  time  series  display  trending  behaviour  and  sometimes  also  seasonal 
fluctuations  and  structural  breaks.  These  aspects  of  the  data  should  be 
modelled  in  a  proper  way  to  be  able  to  draw  reliable  conclusions  from 
estimated  time  series  models.  In  the  year  2003,  the  Nobel  prize  in  economics 
was  awarded  to  Engle  and  Granger,  two  pioneers  in  the  econometric  model¬ 
ling  of  trends  and  changing  volatility  in  time  series.  For  univariate  time  series 
we  discussed  methods  to  model  trends,  seasonals,  parameter  variations,  and 
changing  volatility.  Further  we  considered  the  modelling  of  stationary  time 
series  by  means  of  ARMA  models.  Several  of  the  diagnostic  tests  of  Chapter  5 
can  be  applied  to  investigate  the  empirical  adequacy  of  estimated  time  series 
models.  We  also  considered  time  series  models  with  exogenous  variables  and 
multiple  time  series  models.  The  proper  modelling  of  trends  is  again  of 
crucial  importance  to  obtain  reliable  results.  If  the  variables  contain  stochas¬ 
tic  trends,  then  one  can  estimate  a  vector  autoregressive  model  for  the  first 
differences  of  the  variables,  unless  they  are  cointegrated,  in  which  case  a 
vector  error  correction  model  is  more  appropriate.  Finally  we  paid  attention 
to  data  where  the  number  of  variables  is  large  compared  to  the  length  of  the 
observation  period.  We  discussed  the  SUR  model  and  the  method  of  general¬ 
ized  least  squares,  models  for  panel  data,  and  simultaneous  equation  models. 


FURTHER  READING 

For  further  background  on  the  topics  of  this  chapter  we  provide  some  references. 
The  three  volumes  of  the  Handbook  of  Econometrics  mentioned  in  Chapter  3, 
Further  Reading,  contain  chapters  on  panel  data  and  simultaneous  equation 
models.  The  fourth  volume  in  this  series,  edited  by  Engle  and  McFadden  (1994), 
contains  chapters  on  many  time  series  topics,  including  trends  and  unit  roots,  VAR 
models  and  cointegration,  and  ARCH  models.  From  the  many  textbooks  on  time 
series  we  mention  Brockwell  and  Davis  (1997),  Granger  and  Newbold  (1986), 
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Franses  (1998),  Hamilton  (1994),  and  Harvey  (1991).  A  textbook  with  applica¬ 
tions  is  Patterson  (2000),  the  theory  of  VAR  models  is  described  in  Liitkepohl 
(1991),  and  cointegration  in  Johansen  (1995).  The  SUR  model,  panel  data  models, 
and  SEM  are  discussed  in  most  of  the  econometric  textbooks  mentioned  in 
Chapter  3,  Further  Reading  (p.  178-9),  and  Baltagi  (1995)  deals  exclusively 
with  panel  data. 
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Exercises 


THEORY  QUESTIONS 

7.1*  (=©  Sections  7. 1.2-7. 1.4) 

a.  Let  et  =  yt  —  E[yf|Y(_i]  be  the  innovations  of  a 
stationary  process  that  is  jointly  normally  dis¬ 
tributed.  Show  that  the  process  et  can  be  written 
as  a  linear  function  of  the  past  observations 
{yt-ki  k  >  0}  and  that  it  has  mean  zero  and  con¬ 
stant  variance.  Show  that  all  the  autocorrelations 
pk  ( k  ^  0)  of  the  process  et  are  zero. 

b.  Show  that  an  AR (p)  process  i p(L)yt  =  et  (with 
et  the  innovation  process)  is  stationary  if  and 
only  if  all  the  solutions  of  cf>(z)  =  0  lie  outside 
the  unit  circle.  (This  was  shown  in  Section  7.1.3 
for  p  =  1;  use  this  result  to  prove  the  statement 
first  for  order  p  =  2  by  the  factorization 
4>(z)  =  (1  —  «iz)(l  —  o «2z),  and  then  repeat  this 
idea  for  orders  p  >  2.) 

c.  Prove  that  an  MA (q)  process  yt  =  6(L)et  is  in¬ 
vertible  if  and  only  if  all  the  solutions  of 
8(z)  =  0  lie  outside  the  unit  circle.  (This  was 
shown  in  Section  7.1.4  for  q  =  1;  use  the  factor¬ 
ization  idea  of  b  to  show  this  for  q  >  1.) 

7.2  (-©  Sections  7.1.5,  7.2.2,  7.3.4) 

a.  Show  that  the  ACF  of  an  MA(1)  process  with 
parameter  6  is  the  same  as  that  of  an  MA(1) 
process  with  parameter  1/9.  Discuss  the  relevance 
of  this  finding  for  maximum  likelihood  estima¬ 
tion  of  (AR)MA  models. 

b.  Show  that  in  a  stationary  AR(2)  process 
yt  =  0i34-i  +  0234-2  +  Et  there  holds  that 
<t>2  7^  1,  that  E  [yt-kEt\  =  0  for  all  k  >  0,  and 
that  E[ytst]  =  a2.  Use  these  results  to  prove  that 
7o  =  0171  +  0272  +  7i  =  0i7o  +  027i,  and 
7k  =  0iVk-i  +  027k— 2  f°r  k>2.  Show  that  the 
autocorrelations  are  given  by  p1  =  0j/(l  —  0 2), 
Pi  =  0i/(l  -02>  +  02,  and  pk  =  <t>iPk-i  + 
<P2Pk-i  for  k  >  3- 

c.  Derive  the  first  four  autocorrelations  of  the  (sta¬ 
tionary  and  invertible)  ARMA(1,1)  process 
Vt  =  034-1  +  Et  +  6Et-i. 


d.  Derive  the  ACF  of  Zt  =  AA^  in  the  ‘airline’ 
model  Zt  =  (1  +  9[L)(1  +  64 L4)Et  and  show  that 
pi  =  0  and  that  pi=  p5.  Derive  the  ACF  if 
01  =  04  =  - 1. 

7.3*  (-»  Sections  7.1.5,  7.2.2) 

a.  If  yt  is  a  stationary  process  with  the  property  that 
the  autocorrelations  cut  off  (so  that  pk  =  0  for 
k  >  q),  then  show  that  yt  can  be  written  as  an 
MA(q')  process.  (The  reverse  statement  was 
proven  in  Section  7.1.5.) 

b .  Show  that  a  stationary  process  yt  can  be  written  as 
an  AR (p)  process  if  and  only  if  the  partial  auto¬ 
correlations  cut  off  (so  that  <j)kk  =  0  for  k  >  p). 

c.  Show  that  the  regression  in  (7.12)  provides  the 
partial  autocorrelations  by  using  the  result  of 
Frisch-Waugh  of  Section  3.2.5. 

d.  If  OLS  is  applied  in  an  AR (p)  model  with  con¬ 
stant  term,  then  the  regressors  are  given  by 
x\  =  (1 ,  yt-i ,  •  •  •  ,yt-p).  Show  that,  if  the  process 
is  stationary,  the  matrix  of  second  order 
moments  Q„  =  \  Ylt=p+ 1  x*x't  converges  in  prob¬ 
ability  to  a  non-singular  matrix. 

7.4  (“®  Section  7.1.6) 

a.  Derive  formulas  for  the  fe-step-ahead  forecasts  of 
an  AR(1)  process  yt  =  a  +  4>yt~  1  +  Et  and  for  the 
corresponding  forecast  error  variances,  in  terms 
of  the  parameters  (f),  a,  and  a2. 

b.  Derive  formulas  for  the  3-  and  4-step-ahead  fore¬ 
casts  and  corresponding  forecast  error  variances 
of  an  AR(p)  process. 

c.  Derive  formulas  for  the  /;-step-ahead  forecasts 
and  forecast  error  variances  (h  =  1,  •  •  •  ,4)  of  a 
(stationary  and  invertible)  ARMA(1,  1)  process. 

7.5  Sections  7.3.2,  7.3.4) 

a.  Section  7.3.2  states  the  formula  SPE(f>)  = 
<t2E,ti  (EtM  for  the  forecast  error  vari¬ 
ance  of  an  ARIMA(/?,  1,  q)  process  yt  in  terms 
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of  the  representation  Zt  =  J2T=o  ^k^-k  of  the 
stationary  series  Zt  =  (1  —  L)yt.  Prove  this  result. 

b.  Show  that  the  out-of-sample  /;-step-ahead  fore¬ 
casts  produced  by  EWMA  are  constant,  inde¬ 
pendent  of  the  forecast  horizon  h. 

c.  Show  that  the  Holt-Winters  model  with 
changing  level  pt  and  slope  a*  can  be  written  as 
an  ARIMA(0,  2,  2)  model  by  eliminating  pt  and 
at.  Show  that  the  b-step-ahead  forecasts  of  this 
model  lie  on  a  straight  line. 

d.  Consider  the  model  with  changing  level  pt  and 
seasonal  st  defined  by  yt  =  /(,  +  st  +  Et,  where 
Rt+i  =  /(;  +  nt  and  s?+i  =  -st  -  st-i  -  st-i  +  (f 
and  where  all  error  terms  are  independently  and 
normally  distributed  white  noise  processes.  Show 
that  AA4 yt  is  an  MA(5)  process.  Investigate 
whether  the  series  yt  can  be  described  by  an 
‘airline’  model. 

7.6*  Section  7.4.3) 

a.  Let  st  follow  an  ARCH(l)  process  with  condi¬ 
tional  variance  of  =  ao  +  aiE^j.  Show  that  Et 
has  zero  mean,  that  £f  is  a  white  noise  process 
with  (unconditional)  variance  ao/(l  —  «i),  and 
that  £j  follows  an  AR(1)  process. 

b.  Let  £f  follow  an  ARCH(p)  process.  Show  that  Et  is 
white  noise  and  that  e2  follows  an  AR(p)  process. 

c.  Let  st  follow  a  GARCH(1, 1)  process  with  condi¬ 
tional  variance  a2  =  ao  +  a iej_1  +  01.21 T(_\.  Show 
that  et  is  white  noise  and  that  e2  follows  an 
ARMA(1,  1)  process. 

d.  Show  that  the  process  e2  of  c  is  stationary  if 
0  <  ai  +  0C2  <  1  but  that  it  is  integrated  of  order 
1  if  oci  +  0C2  =  1. 

7.7*  (=©  Sections  7.4.3,  7.4.4) 

In  this  exercise  we  derive  the  ARCH  LM- test  of 
Section  7.4.4  for  the  null  hypothesis  of  no  ARCH 
against  the  alternative  of  an  ARCH(l)  process.  The 
model  for  the  observed  time  series  yt  is  formulated 
as  y;|Y,_i  =  att]t  where  Yf_i  =  (yt-s,  s=  1,2,---) 
(for  simplicity  of  the  analysis  we  assume  that  this 
information  extends  infinitely  far  in  the  past), 
of  =  ao  +  a iyj_ii  and  r\t  is  a  series  of  independent 
variables  with  the  N(0,  1)  distribution.  It  is  given 
that  a 0  >  0  and  0  <  oci  <  1.  In  Exercise  7.6  it  was 
shown  that  yt  is  a  white  noise  process  (so  that 
E[ytyt-k]  =  0  for  all  k  0)  and  that  y2  is  an  AR(1) 
process. 


a.  Give  a  simple  argument  (without  calculations)  to 
prove  that  the  process  yt  cannot  be  normally 
distributed.  Next  show  that  the  probability  dis¬ 
tribution  of  yt  has  kurtosis  larger  than  3. 

b.  The  log-likelihood  is  given  by  (7.31)  with 
a  =  (j)  =  0.  Derive  the  first  and  second  order  de¬ 
rivatives  of  this  function  with  respect  to  ao  and  ai . 

c.  Use  the  results  in  b  to  compute  the  LM- test  for 
the  null  hypothesis  of  conditional  homoskedasti- 
city  (a  1  =  0)  by  means  of  the  general  formula 
(4.54)  in  Section  4.3.6  for  the  LM-test. 

d.  Show  that  this  test  can  be  computed  as  LM=nR2 
of  the  regression  of  yj  on  a  constant  and  y2_t . 

e.  Prove  the  validity  of  the  following  method  for  the 
ARCH  LM-test  for  the  error  terms  in  the  regres¬ 
sion  model  yt  =  x't(S  +  Et.  First  regress  yt  on  xt 
with  residuals  et,  then  regress  e 2  on  a  constant 
and  ej_f  and  let  LM  =  nR 2  of  this  regression.  (It 
may  be  helpful  to  prove  first  that  the  information 
matrix  for  the  parameters  (ao,  aj,  (S')',  evaluated 
at  the  restricted  ML  estimates,  is  block-diagonal.) 

f.  Let  yt  =  x'tP  +  £f,  where  Et  follows  an  ARCH(l) 
process.  Use  the  previous  results  to  show  that 
OLS  is  the  best  linear  unbiased  estimator  of  f 1 
but  that  it  is  not  (asymptotically)  efficient. 

7.8  Section  7.4.4) 

Consider  the  so-called  bilinear  process 
yt  =  \et-]_yt-2  +£?,  where  Et  are  independent  draw¬ 
ings  from  N(0,  a2).  As  starting  conditions  are  given 
£o  =  0  and  y^i  =  yo  =  0. 

a.  Prove  that  yt  is  an  uncorrelated  process.  Is  it  also 
a  white  noise  process? 

b.  Prove  that  y2  is  not  an  uncorrelated  process. 

c.  Prove  that  yt  cannot  be  forecasted  by  linear  func¬ 
tions  of  past  observations  yt_k  (k  >  1)  but  that  it 
can  be  forecasted  by  non-linear  functions  of 
these  past  observations. 

d.  Simulate  n  =  200  data  from  this  process.  Perform 
a  Ljung-Box  test  and  an  ARCH  test  on  the 
resulting  time  series.  What  is  the  relevance  of 
this  result  for  the  interpretation  of  ARCH  tests? 

7.9  (-©  Sections  7.5.1,  7.6.1) 

a.  Rewrite  the  ADL  model  (7.32)  (with  (4  =  0  for  all 
k)  in  error  correction  form  (7.34).  It  is  helpful  to 
consider  the  AR  polynomial  <f>(z)=  1  —  Yjh=\ 
and  to  prove  that  4>(z)  —  4>(l)z  =  (1  —  z)4>*(z). 
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where  0*(z)  is  a  polynomial  of  degree  p  —  1  with 
0*(O)  =  1. 

b.  Show  that  the  coefficient  —0(1)  of  the  ‘error’ 
(yt-\  —  kxt- i  —  S)  in  (7.34)  is  negative  if  the  AR 
polynomial  0(z)  is  stationary  (has  all  its  roots 
outside  the  unit  circle).  Discuss  why  this  is 
needed  for  error  ‘correction’. 

c.  In  Section  7.6.1  we  derived  the  general  VECM 

(7.38) .  Compute  this  representation  explicitly  for 
a  VAR(2)  model. 

d.  Explain  that  stationarity  of  the  VAR  polynomial 
matrix  4>(z)  corresponds  to  correction  in  the  dir¬ 
ection  of  equilibrium  in  the  sense  that  deviations 
(Yf_ i  —  /f)  ^  0  in  (7.38)  die  out  in  the  long  run. 
Give  an  explicit  proof  of  this  fact  for  the  VECM 

(7.38)  with  p  =  1  (so  that  the  terms  L;AY(_;  drop 
out  of  this  equation). 

7.10  (-©  Sections  7.6.2,  7.7.2) 

a.  Prove  that  GLS  in  the  SUR  model  (7.41)  with 
Ojj  =  0  for  all  i  /  ;  (so  that  Cl  is  diagonal)  boils 
down  to  OLS  per  unit. 

b.  Prove  this  also  in  case  Cl  is  non-diagonal  but  the 
regressor  matrix  X,  =  X  is  constant  across  all 
units  z  =  1 ,  •  ••,;».  For  simplicity  consider  only 
the  case  of  m  =  2  units. 

c.  If  a  or  b  holds  true,  then  explain  why  OLS  is 
more  efficient  than  FGLS  for  finite  samples  (and 
equally  efficient  asymptotically). 

d.  Write  the  VAR(p)  model  with  m  variables  in 
terms  of  the  SUR  model  —  that  is,  with  separate 
equations  for  each  of  the  m  variables 
ylt,  i  =  1,  •  •  • ,  m,  t  =  p  +  1,  •  •  • ,  n,  and  ordered 
as  in  (7.41).  What  is  the  structure  of  the 


m(n  —  p)  x  m(n  —  p)  covariance  matrix  of  the 
corresponding  m{n  —  p)  x  1  vector  of  error 
terms?  Explain  that  ML  in  the  VAR(p)  model  is 
equivalent  to  applying  OLS  per  equation. 

7.11  (“®  Section  7.7.4) 

Consider  the  simultaneous  equation  model 

yit  =  71272;  +  eit, 

ylt  =  7n7i;  +  PiiZit  +  PuZit  +  £2 1, 

where  y\t  and  yit  are  endogenous  variables  and  z\t 
and  zit  are  exogenous  variables.  The  disturbances  £i? 
and  e,2t  are  jointly  normally  distributed  with  mean 
zero  and  2x2  covariance  matrix  Cl  and  they  are 
uncorrelated  over  time.  For  n  =  100  observations, 
the  following  matrix  of  sample  moments  around 
zero  is  given  (for  instance,  ^  J^t=i  72;Zt;  =  4). 


Variable 

yi 

72 

Zl 

Z2 

yi 

10 

20 

2 

3 

yi 

20 

50 

4 

8 

Zl 

2 

4 

5 

5 

Zl 

3 

8 

5 

10 

a.  Prove  that  the  first  equation  of  this  SEM  satisfies 
the  order  condition  but  the  second  equation  not. 

b.  Compute  the  2SLS  estimate  of  y12. 

c.  Compute  the  large  sample  standard  error  of  this 
estimate,  using  formula  (5.76)  of  Section  5.7.2  for 
the  asymptotic  distribution  of  the  2SLS  estimator 
and  replacing  the  error  variance  G\\  =  E[e^t]  by 
the  estimate  6n  =  \Y?t= t  (7K  “  7t272;)2- 


EMPIRICAL  AND  SIMULATION  QUESTIONS 

7.12  (-©  Sections  7.2.2,  7.3.3) 

In  this  exercise  we  consider  the  properties  of  the  OLS 
estimator  0  in  the  AR(1)  model  yt  =  4>yt- 1  +  £;  (dis¬ 
cussed  in  Section  7.2.2)  in  some  more  detail.  As  a 
formal  statistical  analysis  is  involved,  we  perform  a 
simulation  study  with  0  =  0.9  and  a1  =  1. 

a.  Generate  a  sample  of  length  n  =  1000  of  this 
process,  with  starting  condition  yo  =  0  and  with 
et  independent  drawings  from  N(0,  1). 


b.  Compare  the  sample  mean  of  yt  and  of  yj  with 
the  theoretical  mean  and  variance  of  the  AR(1) 
process. 

c.  Define  the  process  Zt  =  y;-i£;.  Test  whether  there 
exists  significant  serial  correlation  in  the  process 

zt. 

d.  Repeat  the  simulation  experiment  in  a  1 000  times. 

For  each  simulation,  compute  7;-t£;  and 

~  0)- 
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e.  Compare  the  sample  distributions  of  the  two 
statistics  in  d  with  the  distributions 
N(0,  rr4/(l  —  </>2))  and  N(0, 1  -  <j)2)  respectively 
that  were  mentioned  in  Section  7.2.2  (see  (7.16)). 

f.  Repeat  the  1000  simulations  in  d  with  cf>  =  1 
instead  of  (f>  =  0.9.  Compare  the  sample  distribu¬ 
tions  of  the  two  statistics  in  d  with  the  normal 
distributions  mentioned  in  e.  Explain  the  out¬ 
comes. 

7.13  (-©  Sections  7.3.1,  7.3.4) 

a.  Generate  a  sample  of  size  n  =  400  from  the  de¬ 
terministic  trend  model  y,  =  1  +  t  +  £,,  where  £, 
is  normally  distributed  white  noise  with  mean  0 
and  variance  a1  =  2500.  Plot  the  time  series  yt, 
the  correlogram  of  yt,  and  the  scatter  diagram  of 
yt  against  yt-\.  Plot  also  the  differenced  series 
A yt,  the  correlogram  of  A yt,  and  the  scatter  dia¬ 
gram  of  Ay,  against  Ay,_i. 

b.  Generate  a  sample  of  size  n  =  400  from  the  sto¬ 
chastic  trend  model  yt  =  yt-  \  +  £,,  where  yo  =  0 
and  £,  is  normally  distributed  white  noise.  Make 
the  same  plots  (for  yt  and  for  Ay,)  as  in  a  and 
compare  the  outcomes  with  those  in  a. 

c.  Suppose  that  y,  follows  a  stationary  and  invert¬ 
ible  ARMA  process.  Show  that  the  differenced 
series  xt  =  (1  —  L)y,  follows  an  ARMA  process 
that  is  not  invertible.  Show  also  that  the  auto¬ 
correlations  of  xt  have  sum  Y^k=\  Pk  =  —1/2. 

d.  Illustrate  the  result  in  c  by  means  of  a  suitable 
simulation  experiment. 

7.14  (-»  Sections  7.3.3,  7.4.1) 

In  this  exercise  we  simulate  data  from  the  model 

Ay,  =  a.  +  pyt-i  +  £,,  where  yo  =  0  and  where  £, 

are  independent  drawings  from  N(0, 1). 

a.  Simulate  a  series  of  length  n  =  100  from  the 
model  with  parameters  (a,  p)  =  (0,0).  Estimate 
p  by  OLS  (including  a  constant  in  the  equation) 
and  compute  the  t-value  of  p. 

b.  Repeat  a  10,000  times  and  make  histograms  of 
the  resulting  10,000  estimates  of  p  and  of  the 
/■-values  of  p.  Determine  the  left  5%  quantile  of 
these  /-values  and  compare  the  outcome  with  the 
corresponding  critical  value  in  Exhibit  7.16. 

c.  Repeat  b,  for  the  DGP  with  a.  =  0.5  and  p  =  0  (so 
that  Ay,  =  0.5  +  £,)  and  with  test  equation 
Ay,  =  a  +  [It  +  py,-i  +  £,  (including  constant 
and  trend  term  in  the  test  equation). 


d.  For  each  of  the  10,000  simulated  data  sets  in  c, 
estimate  p  and  compute  the  /-value  of  p  by  re¬ 
gression  in  the  (wrongly  specified)  model  Ay,  = 
py,_i  +  £,  (that  is,  excluding  both  the  constant 
and  the  trend  term  in  the  test  equation).  Deter¬ 
mine  the  left  5%  quantile  of  the  resulting  10,000 
/-values.  Explain  the  relation  of  the  outcomes 
with  the  possible  dangers  of  misspecification  of 
the  Dickey-Fuller  test  equation. 

e.  Contaminate  the  10,000  series  of  b  by  a  single 
additive  outlier  of  size  20  at  time  /  =  50.  How 
many  times  does  the  Dickey-Fuller  test  (with 
constant  included)  now  reject  the  null  hypothesis 
of  a  unit  root  (using  a  significance  level  of  5%)? 
Give  an  intuitive  explanation  of  this  difference 
with  the  outcomes  in  b. 

f.  Now  contaminate  the  10,000  series  of  b  by  an 
innovation  outlier  of  size  20  at  time  /  =  50. 
Answer  the  same  questions  as  in  e. 

g.  Generate  10,000  series  of  the  model  with  param¬ 
eters  (a ,p)  =  (1,  —  0.1)  and  with  a  single  innov¬ 
ation  outlier  of  size  20  at  time  /  =  50.  Answer  the 
same  questions  as  in  e. 

h.  Illustrate,  by  means  of  a  simulation,  that  an  in¬ 
novation  outlier  in  a  random  walk  model  gener¬ 
ates  a  time  series  with  a  permanent  level  shift. 

7.15  (-»  Sections  7.6.1-7.6.3) 

Consider  the  following  VAR(2)  model  in  the  two 

variables  xt  and  y, : 

/*,\=/l\  / 0.5  0.1  \  / xt-i  \ 

W"  V0/  +  V0.4  0.5/ U-J 

0  0\/  Xt- 2  \  /  £i ,  N 

0.25  0/ \yt-2)  +  UJ’ 

where  (£i,,  £2,/  ~  NID(0,  I)  with  I  the  2x2  iden¬ 
tity  matrix. 

a.  Show  that  this  process  is  stationary  and  compute 
the  mean  values  of  the  two  variables. 

b.  Simulate  a  series  of  length  105  from  this  VAR 
model  by  taking  as  starting  values  xt  =  y,  =  3  for 
/  =  1,2  and  computing  the  values  for  /  >  3  by 
means  of  the  two  VAR  equations  and  simulated 
values  for  the  two  independent  white  noise  pro¬ 
cesses  £1,  and  £2,. 

c.  Estimate  a  VAR(2)  model  and  also  a  VAR(l) 
model,  on  the  basis  of  the  first  100  observations. 
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d.  Compare  the  two  models  of  c  by  means  of  an 
LR-test  on  the  significance  of  second  order  lags 
and  also  by  means  of  AIC  and  SIC.  Compare  also 
the  forecast  performance  for  the  forecast  period 
t=  101,  -  -  ■ ,  105. 

e.  Perform  an  appropriate  Johansen  cointegration 
test  and  give  an  interpretation  of  the  outcomes. 

7.16  (-*•  Sections  7.3.4,  7.4.1) 

In  this  exercise  we  consider  some  alterna¬ 
tive  models  for  the  series  yt  of  US  indus¬ 
trial  production  (in  logarithms)  discussed 
in  Example  7.16.  For  estimation  use  again  the  data 
over  the  period  1961.1  to  1994.4. 

a.  Estimate  AR(p)  models  with  deterministic  sea¬ 
sonal  dummies  for  the  series  Ayt,  for  orders 
p  =  0,  1,  -  •  • ,  8.  Which  model  is  preferred  by 
the  AIC  and  SIC  criteria?  For  the  AR(5)  model, 
perform  tests  on  normality  and  serial  correlation 
of  the  residuals. 

b.  Estimate  an  ‘airline’  model  —  that  is,  estimate 
the  MA  parameters  in  the  model  AA^yt  = 
(1  +  9iL)(l  +  04L4)e(.  Perform  tests  onnormality 
and  serial  correlation  of  the  residuals.  Check 
whether  the  series  AA4yt  is  possibly  over¬ 
differenced. 

c.  Compute  sequential  one-step-ahead  forecasts 
and  also  /;-step-ahead  forecasts  (with 
b  =  1,  •  •  • ,  15)  for  the  variable  yt  over  the  period 
1995.1  to  1998.3,  for  the  AR(5)  model  of  a  and 
for  the  airline  model  of  b. 

d.  In  Section  7.2.2  an  AR(2)  model  was  estimated 
for  the  series  A4 yt.  Use  this  model  to  compute 
sequential  one-step-ahead  forecasts  and  also 
^-step-ahead  forecasts  (with  h=  1,  -  -  - ,  15) 
for  the  variable  yt  over  the  period  1995.1  to 
1998.3. 

e.  Compare  the  forecast  performance  of  the  three 
models  in  c  and  d.  Which  model  do  you  prefer? 

f.  In  Section  7.4.1  an  AR(2)  model  for  A4 yt  was 
estimated  with  seven  dummies.  Now  add  two 
dummies  (one  for  1961.3  and  one  for  1975.2) 
and  estimate  the  corresponding  AR(2)  model 
with  nine  dummies.  Test  whether  the  two  groups 
of  three  sequential  outliers  in  the  periods 
1961.1-1961.3  and  1974.4-1975.2  can  be  mod¬ 
elled  by  means  of  two  additive  outliers  (this  gives 
in  total  four  parameter  restrictions  on  the  nine 
dummy  variables). 


7.17  (•*>  Sections  7.2.1,  7.3.2-7.3.4, 

7.6.3) 

The  data  file  contains  monthly  produc¬ 
tion  data  of  nine  Japanese  passenger 
car  industries  over  the  period  1980.1-2001.3.  The 
data  are  taken  from  ‘DataStream’.  In  this  exercise 
we  consider  the  largest  industry,  Toyota,  and 
we  denote  the  corresponding  time  series  by  yt.  The 
data  over  the  period  1980.1-1999.12  should  be  used 
in  estimation  and  diagnostic  testing;  the  remaining 
observations  are  used  to  evaluate  the  forecast  per¬ 
formance  of  models. 

a.  What  conclusions  do  you  draw  from  the  time  plot, 
the  sample  autocorrelations,  and  the  sample  par¬ 
tial  autocorrelations  of  y;?  Argue  why  it  does  not 
seem  necessary  to  take  logarithms  of  this  series. 

b.  Estimate  the  trend  of  this  series  by  means  of  the 
Holt-Winters  method.  What  are  the  estimated 
values  of  the  level  pt  and  the  slope  at  in  Decem¬ 
ber  1999?  Forecast  the  production  of  Toyota  for 
the  twelve  months  in  2000. 

c.  Perform  an  augmented  Dickey-Fuller  test  for  this 
series  (include  four  lags,  and  motivate  your 
choices  concerning  constant  and  trend  term). 
Generate  the  residual  series  (with  name  ‘resdf4’) 
of  the  corresponding  test  equation. 

d.  In  the  Dickey-Fuller  test  in  c  you  made  use  of 
critical  values.  What  assumptions  on  the  series 
Tesdf4’  are  needed  to  use  these  critical  values? 
Which  of  these  assumptions  is  clearly  violated? 
Make  use  of  the  correlogram  of  ‘resdf4’. 

e.  Regress  yt  on  a  constant  and  eleven  seasonal 
dummies.  Generate  the  forecasted  values  (the 
seasonal  components)  of  this  model  for  the  year 
2000  (for  later  use  in  g).  Generate  also  the  re¬ 
sidual  series  (with  name  Testoy’)  and  perform  an 
ADF  test  on  ‘restoy’.  Show  that  this  test  suffers 
less  from  the  problems  mentioned  in  d. 

f.  Follow  the  methodology  of  Section  7.2.1  to  con¬ 
struct  an  ARIMA  model  for  the  series  ‘restoy’. 
Perform  diagnostic  tests  on  your  favourite 
model. 

g.  Use  the  model  of  f  and  the  estimated  seasonal 
components  in  e  to  construct  (dynamic)  forecasts 
of  the  production  of  Toyota  for  the  twelve 
months  in  2000.  Compare  the  quality  of  the 
forecasts  with  those  obtained  in  b. 

h.  Let  xt  be  the  monthly  series  of  Japanese  passen¬ 
ger  car  production,  excluding  Toyota.  Test 
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whether  the  series  xt  and  yt  are  cointegrated,  and 
provide  an  economic  interpretation  of  the 
results. 

i.  The  common  movements  in  the  series  xt  and  yt 
could  be  caused  solely  by  monthly  patterns  in 
production.  Regress  xt  on  a  constant  and  eleven 
seasonal  dummies  and  let  ‘restot’  be  the  residuals 
of  this  regression.  Test  whether  the  deseasona- 
lized  series  ‘restoy’  and  ‘restot’  are  cointegrated, 
and  compare  the  result  with  that  obtained  in  h. 

7.18  (-©  Sections  7.2.1,  7.3.2,  7.3.4, 

7.4.4) 

The  data  file  contains  monthly  energy 
production  data  of  the  USA  for  different 
sorts  of  energy.  The  data  are  taken  from  ‘Economa¬ 
gic’.  Here  we  consider  the  series  yt  of  nuclear  electric 
power  generation  with  data  from  1973.1  to 
1999.11. 

a.  Plot  the  series  over  the  full  sample  and  also  over 
the  sample  1990.1-1999.11.  What  conclusions 
do  you  draw  from  this? 

b.  Regress  the  series  log  (yt)  on  a  constant,  a  deter¬ 
ministic  trend,  and  11  seasonal  dummies,  using 
the  data  over  the  period  1990.1-1998.12.  Give 
an  interpretation  of  the  estimates  of  the  seasonal 
effects. 

c.  Follow  the  method  of  Section  7.2.1  to  construct 
an  ARMA  model  for  the  series  et  of  residuals  of 
the  model  in  b.  Perform  tests  on  normality,  auto¬ 
correlation,  ARCH  effects,  and  parameter  breaks 
on  your  favourite  model. 

d.  Combine  the  models  in  b  and  c  —  that  is,  esti¬ 
mate  the  model  of  b  including  error  terms  that 
follow  your  ARMA  model  of  c.  Perform  the  tests 
of  c  also  on  this  model. 

e.  Use  the  model  of  d  to  forecast  the  series  log(y;) 
over  1999. 

f.  Until  now  we  have  used  a  deterministic  trend  and 
seasonal  for  the  series.  Perform  an  appropriate  test 
for  this  assumption.  Estimate  also  a  model  for 
Ai2  log  (yt)  (that  is,  with  stochastic  seasonal),  in¬ 
cluding  (seasonal  and  non-seasonal)  AR  and  MA 
terms  in  the  model  and  using  the  observations  over 
1990.1-1998.12. 

g.  Use  the  model  with  stochastic  seasonal  in  f  to 
forecast  the  series  log(yt)  over  1999  and  com¬ 
pare  the  forecast  quality  of  this  model  with  that 
of  the  deterministic  trend  model  in  e. 


7.19  (“®  Section  7.4.4) 

Consider  the  monthly  series  of  the  three- 
month  Treasury  Bill  rate  rt  of  Example 
7.25  (in  levels,  as  in  Section  7.5.3)  with 
the  model  rt  —  rt~ i  =  ft  +  /?2r?-i  +  Et ■  In  this  exer¬ 
cise  we  consider  ARCH  models  (7.29)  for  the  error 
term  et.  The  data  file  contains  monthly  observations 
over  the  period  1950.1-1999.12,  which  are  taken 
from  ‘Economagic’. 

a.  Estimate  ^  and  jl2  by  OLS  and  perform  an 
ARCH  test  on  the  residuals.  Also  plot  the  corre- 
logram  of  the  residuals  et  and  of  the  squared 
series  ef . 

b.  Estimate  the  parameters  of  ARCH  models 
(7.29)  of  orders  p  =  1,  ■  ■  ■ ,  4,  by  regressing  ej 
on  a  constant  and  ef_j ,  ■  •  • ,  ef_p.  Which  order 
of  p  do  you  prefer? 

c.  Now  estimate  Pl  and  fS2  in  the  model  with 
ARCH(p)  error  terms  by  maximum  likelihood, 
for  p  =  1,  •  •  • ,  4.  Compare  the  estimates  of 
/h ,  p2,  and  of  a*  in  (7.29)  with  those  obtained 
in  a  and  b. 

d.  Perform  tests  to  choose  the  order  p  of  the  ARCH 
model,  based  on  the  outcomes  in  c. 

e.  Construct  the  series  of  estimated  variances  of 
obtained  from  (7.29)  by  substituting  the  esti¬ 
mates  of  the  parameters  a. k  of  the  preferred 
model  in  d  and  by  replacing  the  terms  ef  by 
( e f)2,  where  e*t  is  the  series  of  residuals  of  the 
preferred  model  in  d.  Compare  the  series  of 
with  the  series  (e*t)2  to  evaluate  the  quality  of 
the  forecasted  risks  in  interest  rate  changes. 

f.  Perform  a  test  on  the  presence  of  remaining 
ARCH  effects  and  a  test  on  normality  of  the 
standardized  residuals  e*t/dt.  What  conclusions 
do  you  draw  from  these  outcomes? 

7.20  (•%>  Sections  7.2.4,  7.3.2,  7.3.3, 

7.6.3) 

The  data  file  contains  yearly  data  on  the 
gross  national  product  (GNP)  for  a 
number  of  countries.  We  consider  the  series  yt  con¬ 
sisting  of  the  natural  logarithm  of  US  GNP  over  the 
period  1870-1993.  The  data  are  taken  from 
A.  Maddison,  Monitoring  the  World  Economy 
1820-1992  (OECD,  1995). 

a.  Investigate  the  nature  of  the  trend  in  the  series  yt 
over  the  full  sample  period,  by  means  of  ADF 
tests  (both  t  and  F). 
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b.  Investigate  the  nature  of  the  trend  also  over  the 
three  subperiods  1870-1929,  1900-49,  and 
1950-93. 

c.  Use  the  data  over  the  period  1950-89  to  estimate 
the  yearly  growth  rate  of  US  GNP  by  means  of 
two  simple  models,  the  deterministic  trend 
model  yt  =  a  +  fit  +  et  and  the  stochastic  trend 
model  yt  =  y(_!  +  a  +  et. 

d.  Use  the  two  models  of  c  to  forecast  yt  for  the 
period  1990-3  and  compare  the  forecast  quality. 

e.  Try  to  improve  on  the  forecast  results  of  d  by 
adding  ARMA  terms  for  the  short-run  fluctu¬ 
ations  to  the  trend  models  for  yt.  The  model 
selection  should  be  based  on  the  data  over  the 
period  1950-89. 

f.  The  data  file  also  contains  GNP  data  for  Ger¬ 
many,  Japan,  and  the  UK.  Investigate  the  presence 
of  cointegration  between  the  four  GNP  series 
(in  logarithms),  both  for  the  period  1950-93 
and  for  the  period  1870-1993.  Motivate  your 
choice  of  cointegration  test  and  comment  on  the 
outcomes. 

7.21  (-*•  Sections  7.3.3,  7.5.2,  7.6.3) 

In  this  exercise  we  consider  yearly  data  on 
gasoline  consumption  in  the  USA  over  the 
period  1970-99.  The  data  file  contains 
data  on  gasoline  consumption  (GC),  gasoline  price 
(PG),  and  disposable  income  (RI),  all  measured  in 
real  terms  and  taken  in  logarithms.  These  data  were 
previously  discussed  in  Examples  5.31  and  5.34 
(p.  402-4,  416-18).  In  Example  5.31  we  considered 
the  regression  GCt  =  a  +  pPGt  +  yRIt  +  £»,  and 
now  we  will  consider  whether  this  regression  is 
possibly  spurious  due  to  trends  and  whether  lags 
should  be  added  to  this  model.  Use  a  significance 
level  of  5%  in  all  tests  of  this  exercise. 

a.  Regress  GC  on  a  constant  and  the  variables  PG 
and  RI. 

b.  Test  for  the  presence  of  stochastic  trends  in  the 
variables  GC,  PG,  and  RI.  Include  a  constant 
and  deterministic  trend  in  the  test  equations. 

c.  Test  for  the  presence  of  cointegration  between 
the  three  variables.  Compare  the  price  elasticity 
b  of  the  estimated  cointegration  relation 
GCt  =  a  +  bPGt  +  cRIt  +  dt  with  the  estimate 
of  a. 

d.  Test  for  the  presence  of  residual  correlation  in  the 
model  of  a.  Estimate  two  ADL  models,  including 


p  =  1  or  p  =  2  lags  of  the  variables  GCt,  PGt, 
and  RIt  as  additional  regressors. 

e.  Which  of  the  three  models  of  a  and  d  is  preferred 
on  the  basis  of  LR- tests  on  the  significance  of  the 
additional  lagged  terms?  And  which  model  is 
preferred  by  SIC?  Test  the  selected  model(s)  for 
the  presence  of  residual  correlation. 

f.  Compute  the  long-run  elasticities  of  price  and 
income  of  the  preferred  ADL  model(s)  of  e  and 
rewrite  this  in  error  correction  form.  Give  an 
interpretation  of  the  outcome. 

7.22  (”®  Section  7.4.4) 

The  data  file  contains  monthly  data  for 
the  UK  on  the  returns  in  the  sector  of 
cyclical  consumer  goods  (denoted  by  yt) 
and  in  the  market  (denoted  by  xt)  over  the 
period  1980.01-2000.03.  The  CAPM  postulates 
a  linear  relation  between  the  returns  —  that  is, 
yt  =  a.  +  fixt  +  et.  The  data  are  taken  from  ‘Data- 
Stream’  and  were  analysed  previously  in  Examples 
5.27  and  5.28  (p.  384-6,  387-8). 

a.  Estimate  the  CAPM,  using  data  only  over  the 
period  1980.01-1999.12.  Investigate  the  series 
of  residuals  et,  in  particular  a  time  plot  and  the 
correlograms  of  et  and  of  the  squared  series  ej . 

b.  Perform  tests  to  show  that  et  has  no  serial  correl¬ 
ation  but  that  it  has  significant  ARCH.  Estimate 
an  ARMA  model  for  the  squared  residuals  ej, 
with  orders  based  on  the  test  outcomes. 

c.  Obtain  a  new  estimate  of  the  CAPM 
yt  =  a  +  lixt  +  et  together  with  the  GARCH 
model  of  b  for  the  error  terms  et.  That  is,  estimate 
this  combined  model  by  ML  (instead  of  the  two- 
step  approach  in  a  and  b).  Use  again  the  data 
over  1980.01-1999.12. 

d.  Test  for  normality,  serial  correlation,  and  ARCH 
effects  in  the  (standardized)  residuals  of  the 
model  in  c.  Compare  the  outcome  of  the  Jarque- 
Bera  test  on  normality  with  the  test  results  in 
Section  5.6.3  (p.  387-8)  for  the  CAPM  without 
GARCH,  and  explain  the  differences. 

e.  Compare  the  estimates  of  c  with  those  obtained 
in  a  and  b.  Comment  on  the  similarities  and 
differences.  Make  scatter  diagrams  of  the  returns 
yt  against  the  predicted  values  yt  and  of  the  risk 
ef  against  the  estimated  variance  of  (the  esti¬ 
mated  value  of  the  variance  of  =  £[ef  ]  obtained 
from  the  model). 
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f.  Use  the  model  of  c  to  forecast  yt  and  of  =  E[ej] 
for  the  first  three  months  in  2000  (use  the  actual 
values  of  xt  in  these  months).  Use  this  model  also 
to  construct  95%  forecast  intervals  and  compare 
the  outcomes  with  the  actual  values  of  yt  in  these 
months. 

7.23  (•*>  Sections  7.3.3,  7.4.1,  7.4.4, 

7.6.3) 

The  data  file  contains  monthly  data  on 
price  levels  and  exchange  rates  for  a 
number  of  countries.  The  data  are  taken  from  ‘Inter¬ 
national  Financial  Statistics’.  We  consider  the  data 
for  Germany  and  the  UK  and  we  denote  the  con¬ 
sumer  price  indices  by  Pg  and  Puk,  the  exchange 
rate  of  the  Deutsche  Mark  to  1  US  Dollar  by  Xq  and 
the  exchange  rate  of  the  British  Pound  to  1  US 
Dollar  by  Xuk •  The  nominal  exchange  rate  of 
the  Mark  against  1  Pound  is  equal  to  Xg/Xuk  and 
the  relative  price  level  of  Germany  against  the  UK 
is  Pg/Puk ■  The  Purchasing  Power  Parity  (PPP) 
hypothesis  of  international  economics  states 
that  Xg/Pg  =  Xuk/ Puk,  or  equivalently  that 
Xg/Xgk  =  Pg/Puk •  In  the  following  questions  we 
consider  data  for  the  four  series  Pg,  Puk,  Xg  and 
Xgk  over  the  period  1975.1-1994.12.  The  PPP 
hypothesis  is  usually  written  as  log  (Xg/Xuk)  = 
log(Pc)  —  log  (Puk)-  Use  a  significance  level  of  5% 
in  all  tests  of  this  exercise. 

a.  Make  a  plot  of  the  two  price  series  Pg  and  Puk 
and  also  of  the  two  exchange  rate  series  Xg  and 
Xuk- 

b.  Test  for  the  presence  of  unit  roots  in  the  four 
series  (all  in  logarithms)  and  test  also  for  the 
presence  of  cointegration. 

c.  Perform  the  two  tests  of  b  also  for  the  set  of  three 
series  log  (Xg/Xuk),  log  (PG),  and  log  (Puk). 
For  reasons  of  economic  interpretation,  do 
not  include  a  trend  in  the  cointegration  equa¬ 
tion. 

d.  The  PPP  can  be  formulated  in  econometric  terms 
by  the  hypothesis  that  log  (Xg/Xuk)  —  log  (Pg) 
+  log  (Puk)  should  be  a  stationary  time  series. 
Test  this  hypothesis,  compare  the  outcome  with 
the  one  obtained  in  c,  and  give  an  economic 
interpretation. 

e.  Let  yt  =  A  log  (Xq /Xuk)  be  the  series  of  monthly 
relative  changes  in  the  exchange  rate  between 
Germany  and  the  UK.  Perform  tests  for  autocor¬ 
relation  and  for  ARCH  in  the  series  yt. 


f.  Test  for  the  presence  of  outliers  in  the  series  yt  of 
e.  Discuss  the  possible  relevance  of  this  for  the 
analysis  of  PPP  in  d. 

7.24  (■*>  Sections  7.3.3,  7.6.3) 

The  data  file  contains  yearly  data  on  the 
Standard  and  Poor  index  yt  and  dividends 
xt,  both  in  real  terms,  over  the  period 
1871-1987.  The  data  are  taken  from  R.  J.  Shiller, 
Market  Volatility  (MIT  Press,  1989).  The  Present 
Value  theory  of  financial  economics  states  that  the 
stock  price  yt  is  determined  by  the  expected  future 
dividends.  Let  <5  be  the  discount  factor;  then  this  can 
be  formulated  as  yt  =  X)2i  [xt+i\.  If  the  expected 
dividends  would  be  constant  so  that  xt+i  =  x 
then  the  corresponding  equilibrium  value  of  yt  is 
y  =  X^St  d'x  =  T ^sx-  A  further  finding  in  financial 
economics  is  that  stock  prices  often  follow  random 
walks,  in  which  case  the  equilibrium  relation 
y  =  yz^x  corresponds  to  the  presence  of  cointegra¬ 
tion  between  the  series  yt  and  xt. 

a.  Make  a  time  plot  of  the  series  yt  and  xt.  Test  for 
the  presence  of  unit  roots  in  both  series  by  means 
of  appropriate  ADF  tests. 

b.  Test  for  the  presence  of  cointegration  between 
the  series  yt  and  xt.  Include  a  deterministic 
trend  in  the  cointegration  relation. 

c.  Estimate  a  vector  error  correction  model  for  the 
series  Yt  =  (yt,  xt)' .  Include  an  appropriate 
number  of  lagged  terms  A Yt-k  and  take  as  error 
correction  term  (yt-i  —  6xt~ 1  —  a  —  [It). 

d.  Give  an  interpretation  of  the  adjustment  coeffi¬ 
cients  of  the  VECM  in  c.  Give  also  an  interpret¬ 
ation  of  the  estimated  parameter  9  by  computing 
the  corresponding  value  of  the  discount  factor  3 
in  the  Present  Value  model. 


7.25  (•*•  Sections  7.7.2,  7.7.3) 

The  data  file  contains  quarterly  fashion 
sales  data  of  a  US  retailer  with  multiple 
specialty  divisions  over  the  period 
1986.1-1992.4.  This  is  a  panel  data  set  with  m  =  5 
units  and  n  =  28  observations.  The  first  two  div¬ 
isions  specialize  in  high-priced  fashion  apparel,  div¬ 
ision  3  in  low-priced  clothes,  and  divisions  4  and  5  in 
specialities  like  large  sizes,  undergarments,  and  so  on 
(the  data  of  division  1  were  previously  analysed  in 
Example  5.6  (p.  305-7)).  The  data  consist  of  the 
quarterly  sales  Stl  of  the  divisions,  the  purchasing 
ability  At,  and  the  consumer  confidence  Ct.  Motiv- 


Exercises  721 


ated  by  the  results  in  Section  5.3.1  we  formulate  the 
model 

log  (Sit)  =  <*i  +  7,2  D2i  +  oc(3D3f  +  a  l4D4t 
+  Pi  log  (At)  +  Ji  log  (Cf )  +  Eit 

where  i=l,--,5  denotes  the  division  and 
f  =  1,  -  -  - ,  28  the  observation  number  and  where 
Djt  are  seasonal  dummies  (j  =  2,  3,  4,  the  first  quar¬ 
ter  is  taken  as  the  reference  season). 

a.  Estimate  the  above  model  (with  thirty  regression 
parameters)  by  OLS.  Check  that  there  exists  sig¬ 
nificant  contemporaneous  correlation  between 
the  residual  terms  for  the  five  divisions  in  the 
same  quarter. 

b.  Estimate  the  model  also  by  SUR.  Compare  the 
outcomes  with  a,  and  explain. 

c.  Estimate  a  panel  model  with  fixed  effects  and 
with  different  parameters  for  the  seasonal  dum¬ 
mies.  This  means  that  the  parameters  /l,  and  y(- 
are  constant  across  the  divisions,  so  that  the 
model  contains  in  total  twenty-two  parameters. 
Compare  the  results  with  a  and  b. 

d.  Estimate  the  model  of  c  with  the  restriction  of 
equal  seasonal  effects  a,y  across  the  divisions  but 
with  different  fixed  effects  a,,  so  that  the  model 
contains  in  total  ten  regression  parameters.  Com¬ 
pare  the  estimated  seasonal  effects  with  the  esti¬ 
mates  in  c. 

e.  Perform  an  LR- test  of  the  twelve  parameter  re¬ 
strictions  that  reduce  the  model  in  c  to  the  model 
in  d. 

f.  Explain  why  it  would  make  little  sense  to  estimate 
a  random  effects  panel  model  for  these  data. 


Farm  Economics,  43  (1961),  813-37.  The  variables 
are  the  quantity  traded  (Q),  the  price  ( P ),  real 
disposable  income  ( RI ),  current  advertisement 
expenditures  (AC),  and  past  advertisement  expend¬ 
itures  (AP,  averaged  over  the  past  ten  years). 
First  we  assume  that  the  supply  Q  is  fixed  and  that 
the  price  is  determined  by  demand  via  the  price 
equation 

log  (Pt)  =  a  +  y  log  (Qt)  +  P  log  (RIt)  +  et. 

a.  Estimate  the  price  equation  by  OLS.  Test  the  null 
hypothesis  of  unit  price  elasticity  (y  =  —1). 

b.  Estimate  the  price  equation  also  by  IV,  using  as 
instruments  a  constant,  log  (RIt),  log(AQ),  and 
log  (APt).  Test  again  the  null  hypothesis  of  unit 
elasticity. 

c.  Perform  the  Hausman  test  for  the  exogeneity  of 
log  (Qt)  in  the  price  equation. 

d.  Investigate  the  quality  of  the  instruments  —  that 
is,  whether  they  are  sufficiently  correlated  with 
log  (Qt)  and  uncorrelated  with  the  price  shocks  £t 
(take  the  IV  residuals  as  estimates  of  these 
shocks). 

e.  Answer  questions  b,  c,  and  d  also  for  the  n  =  45 
observations  obtained  by  excluding  the  data  over 
the  period  1942-6. 

Next  we  consider  the  simultaneous  model  for  price 
and  quantity  described  by  the  following  two  equa¬ 
tions.  We  exclude  the  two  advertisement  variables 
(AC  and  AP)  from  the  analysis  in  f  and  g. 

(demand)  log  (Pt)  =  oi1  +  y1  log  (Q,)  +  ft  log (RIt)  +  elt, 
(supply)  log  (ft)  =  1X2  +  72  log  (Qf)  +  «2/- 


7.26  (“®  Section  7.7.4) 

The  data  file  contains  fifty  yearly  data  on 
the  market  for  oranges  in  the  USA  over 
the  period  1910-59.  The  data  are  taken 
from  M.  Nerlove  and  F.  V.  Waugh,  ‘Advertising 
without  Supply  Control:  Some  Implications  of  a 
Study  of  the  Advertising  of  Oranges’,  journal  of 


O  O 
u  n  z 

ft,  sft 
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f.  Is  the  demand  equation  identified?  Estimate  this 
equation  by  OLS.  What  is  your  interpretation  of 
the  outcomes? 

g.  Is  the  supply  equation  identified?  Estimate  this 
equation  by  a  method  that  you  find  most  appro¬ 
priate,  and  motivate  your  choice. 


This  page  intentionally  left  blank 


Appendix  A.  Matrix  Methods 


In  this  appendix  we  summarize  some  matrix  methods  and  some  results  on  func¬ 
tions  of  several  variables.  At  the  beginning  of  each  section  we  state  in  which 
chapters  or  sections  the  discussed  topics  are  used.  For  more  background  on  these 
topics  there  exist  numerous  textbooks  on  linear  algebra  and  calculus.  See,  for 
instance,  G.  Strang,  Linear  Algebra  and  its  Applications  (San  Diego:  Harcourt 
Brace  Jovanovich,  1988);  D.  C.  Lay,  Linear  Algebra  and  its  Applications  (Read¬ 
ing:  Addison-Wesley,  1997);  and  J.  R.  Magnus  and  H.  Neudecker,  Matrix  Differ¬ 
ential  Calculus  with  Applications  in  Statistics  and  Econometrics  (Chichester: 
Wiley,  1999). 


A.i  Summations 

Used  in  Chapters  1-7. 


Sum  notation 

Many  computations  in  econometrics  involve  summations  of  large  amounts  of 
numbers.  For  convenience  of  notation  such  summations  are  often  denoted 
by  the  summation  symbol  The  sum  of  the  n  numbers  yi,  yi,  ■  •  • ,  yn  is 
denoted  by 


^2  y<  =  yi  +  yi  h - vyn- 

i=  1 

Sometimes,  if  the  value  of  n  is  clear  from  the  context,  we  write  ]Cy,  instead 
of  y>-  a  simhar  way,  Y^i=  l  yf  denotes  the  sum  of  the  squared 

values  y\  +  y\  H - b  yj;,  and  'fZl- 1  Vixi  is  the  sum  of  products  yixi  +  yyxi 

+  •  •  •  +  ynxn- 


Properties  of  summations 

By  writing  out  the  involved  summations  one  can  verify  that,  for  any  constants 
a  and  b  that  do  not  depend  on  i ,  there  holds  t a  ~  na >  =  a  i  Vh 

and  (aJi  +  bxi)  =  a  YTi=\  Vi  +  b  YTi=i  x>-  We  °ften  work  with  numbers  in 
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deviation  from  their  (sample)  mean  denoted  by  y=f^”=1y,-.  The  following 
properties  are  useful: 

n  n  n  n  n 

Y  -y)  =  0,  ^ (y,  -y)2  =  YyY  ny2'  Y  &  ~ y^x<  ~X'I  =  Y y,x‘  ~  nyx- 

i=  1  i=  1  *=1  /— 1  /=1 


XMA01SIM 


Example  A.1:  Simulated  Data  on  Student  Learning 

We  consider  (hypothetical)  values  of  the  scores  of  five  students  for  their  Freshman 
Grade  Point  Average  (FGPA)  and  for  their  SAT  mathematics  test  (SATM)  and 


Panel  1:  scores 

STUDENT 

FGPA 

SATM 

SATV 

1 

1.8 

4 

4 

2 

2.4 

6 

5 

3 

2.9 

6 

7 

4 

3.0 

7 

6 

5 

3.5 

8 

7 

Panel  2: 

operations  on  FGPA  scores 

STUDENT  FGPA 

FGPA M 

FGPA S 

FGPA MS 

1 

1.8 

-0.92 

3.24 

0.8464 

2 

2.4 

-0.32 

5.76 

0.1024 

3 

2.9 

0.18 

8.41 

0.0324 

4 

3.0 

0.28 

9.00 

0.0784 

5 

3.5 

0.78 

12.25 

0.6084 

SUM 

13.6 

0 

38.66 

1.6680 

MEAN 

2.72 

0 

7.732 

0.3336 

Panel  3:  operations  on  FGPA  and  SATM 

scores 

STUDENT 

FGPA 

FGPA M 

SATM  SATM M 

FGPA*  SATM 

FGPA M*SATM M 

1 

1.8 

-0.92 

4.0 

-2.2 

7.2 

2.024 

2 

2.4 

-0.32 

6.0 

-0.2 

14.4 

0.064 

3 

2.9 

0.18 

6.0 

-0.2 

17.4 

-0.036 

4 

3.0 

0.28 

7.0 

0.8 

21.0 

0.224 

5 

3.5 

0.78 

8.0 

1.8 

28.0 

1.404 

SUM 

13.6 

0 

31.0 

0 

88.0 

3.680 

MEAN 

2.72 

0 

6.2 

0 

17.6 

0.736 

Exhibit  A.1  Simulated  Data  on  Student  Learning  (Example  A.l) 

Panel  1  contains  scores  on  FGPA  (on  a  scale  from  1  to  4)  and  on  SATM  and  SATV  (on  a  scale 
from  1  to  10)  of  five  (hypothetical)  students;  Panel  2  shows  the  scores  on  FGPA,  the  scores  in 
deviation  from  the  mean  (FGPA_M),  the  squares  of  FGPA  (FGPA_S),  and  the  squares  of  FGPA 
in  deviation  from  the  mean  (FGPA_MS);  and  Panel  3  shows  the  scores  on  FGPA  and  SATM, 
the  scores  in  deviation  from  the  mean  (FGPA_M  and  SATM_M),  the  products  of  the  FGPA  and 
SATM  scores,  and  the  products  of  the  transformed  values  FGPA_M  and  SATM_M. 
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verbal  test  (SATV).  The  scores  are  in  Panel  1  of  Exhibit  A.l,  and  Panels  2  and  3 
report  the  results  of  some  operations  on  these  numbers.  (For  a  real  data  set  on 
student  learning  of  609  students,  we  refer  to  Example  1.1  (p.  12);  here  we  restrict 
the  attention  to  five  students  to  get  small  matrices  that  are  convenient  as  an 
introduction.)  We  denote  the  five  students  by  the  index  i=  1,  •  •  • ,  5,  their  FGPA 
by  yn  and  their  SATM  by  x,.  For  instance,  y 2  =  2.4  and  X4  =  7.  The  computations 
in  Exhibit  A.l  show  that  y  =  2.72,  that  (y<  —  y)  =  and  that 

(y<  —  y)2  =  1-668,  which  is  equal  to  ^,s=1  yj  —  ny2  =  38.66  —  5(2. 72)2. 
The  computations  in  Panel  3  of  Exhibit  A.l  further  show  that 
!C/=i  (Vi  ~  y)(xi  —x)  =  3-68  which  is  equal  to  J^Li  y>xi  ~  nyx  =  88  —  5  •  2.72  •  6.2. 


A.2  Vectors  and  matrices 

Used  in  Chapters  1,  3-7. 


Data  table 

In  econometrics  we  are  concerned  with  modelling  observed  data.  In  many  cases  the 
number  of  numerical  data  is  large  (several  hundreds  or  thousands  of  observations 
on  a  number  of  possibly  interesting  variables)  and  all  the  data  should  be  handled 
in  an  organized  way.  Many  data  sets  are  stored  in  a  spreadsheet  where  each 
column  corresponds  to  a  variable  and  the  length  of  the  column  is  equal  to  the 
number  of  observations  of  that  variable.  For  instance,  the  data  on  FGPA,  SATM, 
and  SATV  of  five  students  in  the  example  in  the  foregoing  section  can  be  repre¬ 
sented  by  the  following  table. 


Student 

FGPA 

SATM 

SATV 

1 

1.8 

4 

4 

2 

2.4 

6 

5 

3 

2.9 

6 

7 

4 

3.0 

7 

6 

5 

3.5 

8 

7 

Data  matrix 

The  real  data  information  consists  of  the  five  paired  scores  on  FGPA,  SATM,  and 
SATV  and  we  can  summarize  these  data  by  the  following  array  of  numbers: 

/ 1.8  4  4\ 

2.4  6  5 
2.9  6  7  . 

3.0  7  6 
\  3.5  8  7/ 


Such  a  rectangular  block  of  numbers  is  called  a  matrix.  The  above  matrix  has  five 
rows  and  three  columns.  In  econometrics  we  often  work  with  matrices,  and  of 
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course  we  should  always  remind  ourselves  and  other  users  what  is  the  meaning 
of  the  columns  and  rows  (in  this  case,  the  correspondence  between  columns  and 
variables  and  between  rows  and  students,  so  that  the  number  2.9  in  column 
1  and  row  3  is  known  to  correspond  to  the  FGPA  of  the  third  student). 

Matrix  notation 

More  generally,  let  A  be  a  matrix  with  p  rows  and  q  columns;  then  we  say  the  A  is 
a  p  x  q  matrix  and  we  denote  the  number  in  row  i  and  column  j  by  ajj.  The  matrix 
is  then  of  the  form 


/  #11 

#12  ' 

&lq  \ 

#21 

#22  ' 

&2q 

V  #H 

#p2 

■  ■  apq  J 

For  instance,  the  matrix  in  the  student  example  has  p  =  5  rows  and  q  =  3  columns 
and  an  =  4  and  =  2.4,  and  so  on. 

Matrix  notation  in  econometrics 

In  this  appendix  we  follow  the  convention  of  matrix  algebra  to  denote  the  element 
on  row  i  and  column  /  by  a,r  However,  in  econometrics  we  often  denote  this 
element  by  a/j —  that  is,  the  first  index  refers  to  the  variable  (that  is,  the  column) 
and  the  second  index  to  the  observation  number  (that  is,  the  row),  and  for 
shorthand  notation  we  then  often  even  write  ajj  for  this  number  (see,  for  instance, 
Section  3.1.2  (p.  120)).  This  may  be  somewhat  confusing  in  the  beginning,  but  in 
essence  it  does  not  matter  which  convention  we  follow  as  long  as  we  are  clear 
what  we  mean  by  our  notation.  Therefore,  in  this  appendix  we  follow  the 
convention  of  matrix  algebra  books  to  make  it  easier  to  consult  books  that  are 
specialized  in  this  topic,  but  in  the  main  text  we  follow  the  convention  of  most 
econometricians. 

Special  matrices 

A  square  matrix  is  a  matrix  that  has  an  equal  number  of  rows  and  columns  —  that 
is,  with  p  =  q.  A  diagonal  matrix  is  a  square  matrix  with  the  property  that  ajj  =  0 
for  all  i  ^  j —  that  is,  all  elements  are  zero  except  possibly  the  elements  an  on  the 
diagonal  (with  equal  row  and  column  index).  A  special  case  of  a  diagonal  matrix  is 
the  identity  matrix  that  has  an  =  1  for  alii  =  1,  •  •  • ,  p,  and  a,j  =  0  for  all  i  ^  j.  The 
p  x  p  identity  matrix  is  denoted  by  Ip  or  simply  by  I.  A  zero  matrix  is  a  (square  or 
non-square)  matrix  with  all  its  elements  equal  to  zero  —  that  is,  a,j  =  0  for  all  i 
and  j.  The  p  x  q  zero  matrix  is  denoted  by  Opq  or  simply  by  O. 

A  matrix  with  only  one  column  —  that  is,  a  p  x  i  matrix  —  is  called  a 
column  vector,  and  a  I  x  q  matrix  is  called  a  row  vector.  A  column  vector 
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is  often  simply  called  a  vector.  Whereas  matrices  are  denoted  by  capital 
letters  (A,  B,  and  so  on),  vectors  are  usually  denoted  by  lower-case  letters  (a,  b, 
and  so  on). 


A.3  Matrix  addition  and  multiplication 

Used  in  Chapters  1,  3-7. 


Matrix  addition 

This  section  and  the  next  one  contain  a  number  of  operations  on  matrices  that  are 
much  used  in  econometrics.  Let  A  be  a  p  x  q  matrix  with  elements  a, y  and  let  B 
also  be  a  p  x  q  matrix  with  elements  b,j.  Then  the  sum  C  =  A  +  B  of  the  two 
matrices  is  defined  by  the  p  x  q  matrix  with  elements  c,y  =  u,7  +  btJ  in  row  i  and 
column  /(/=  1,  •••,/?,/=  1,  ••• ,  q).  Note  that  A  and  B  should  have  the  same 
number  of  rows  and  also  the  same  number  of  columns. 

Matrix  multiplication 

Let  Aheap  x  q  matrix  with  elements  a,j  (i  =  1,  •  •  • ,  p,  j  =  1,  •  •  • ,  q)  and  let  B  be  a 
q  x  r  matrix  with  elements  bjf,  (j  =  1,  •  •  • ,  q,  k  =  f ,  •  •  • ,  r).  Then  the  product 
C  =  AB  of  the  two  matrices  is  defined  by  the  p  x  r  matrix  with  elements 

<? 

Cjk  =  ciijbjk  =  anb ^  +  aabik  +  •  •  •  +  ciiqbqk 
i=  i 

(/  =  1,  •  •  • ,  p,  k  =  1,  •  •  • ,  r).  So  the  element  in  row  i  and  column  j  of  the  product 
matrix  AB  is  obtained  by  multiplying  the  /th  row  of  A  (element-wise)  with  the  /th 
column  of  B.  This  requires  that  the  number  of  columns  of  the  matrix  A  is  equal  to 
the  number  of  rows  of  the  matrix  B,  otherwise  the  product  AB  is  not  defined. 

If  the  p  x  q  matrix  A  is  multiplied  with  the  q  x  1  vector  b,  then  the  product  Ab  is 
apx  1  vector.  If  in  addition  a  is  a  row  vector  so  that  p  =  1,  then  the  product  ab  of 
a  row  vector  with  a  column  vector  is  a  1  x  1  vector  —  that  is,  a  scalar  number. 
Note  that  the  product  ba  of  the  column  vector  b  with  the  row  vector  a  is  a  p  x  q 
matrix.  Let  A  be  a  p  x  q  matrix  with  elements  u,y  (i  =  1,  •  •  • ,  p,  j  =  1,  •  •  • ,  q)  and 
let  d  be  a  real  number;  then  the  scalar  multiple  dA  is  defined  as  the  p  x  q  matrix 
with  elements  datj  (i  =  1,  •  •  • ,  p,  j  =  1,  •  •  • ,  q). 

Calculation  rules 

The  following  calculation  rules  hold  true  for  matrix  products  and  scalar  multiples 
(we  use  the  notation  (p  x  q)  to  denote  the  number  of  rows  and  columns  of  a 
matrix). 
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A(p  x  q),  B(q  x  r),  C(q  x  r ) 
A(p  x  q),  B(p  x  q),  C(q  x  r) 
A(p  x  q),  B(q  x  r),  C(r  x  s) 
A(p  x  q) 
A(p  x  q) 

A(p  x  q),  B(p  x  q),  J(scalar) 
A(p  x  q),  B(q  x  r),  J(scalar) 


A(B+  C)  =  AB  +  AC 

(A  +  B)C  =  AC  +  BC 

A(BC)  =  (AB)C 

IpA  =  AIq  =  A 

OrpA  —  Org,  AOqr  —  Opr 

d(A  +  B)  =  dA  +  dB 

d(AB)  =  (dA)B  =  A(dB) 


Although  some  of  the  operations  on  matrices  have  the  same  properties  as  corres¬ 
ponding  operations  on  (scalar)  numbers,  this  does  not  hold  true  for  all  operations. 
In  particular,  if  A  is  a  p  x  q  matrix  and  B  a  q  x  r  matrix,  then  AB  (which  is  a  well- 
defined  p  x  r  matrix)  is  not  the  same  as  BA.  The  product  BA  is  not  even  defined  if 
r  ^  p,  and  even  if  r  =  p  then  AB  has  p  rows  and  columns  whereas  BA  has  q  rows 
and  columns,  and  even  if  in  addition  p  =  q  then  still  (except  in  special  cases) 

AB  ±  BA. 


E 


Example  A. 2:  Simulated  Data  on  Student  Learning  (continued) 

We  illustrate  the  use  of  matrices  and  vectors  as  an  efficient  tool  for  data  organiza¬ 
tion  by  considering  again  the  data  on  scores  of  five  students.  Let  y,  denote  the 
FGPA  score  of  student  i  and  let  x,  be  the  SATM  score  and  z,  the  SATV  score  of  this 
student.  As  a  simple  model  for  the  explanation  of  FGPA  in  terms  of  SATM  and 
SATV  we  consider  the  linear  relationship 


y,  =  b\  +  b2x,  +  b2z„  i=  1,  ■  •  • ,  5. 


If  we  substitute  the  numbers  for  y„  x,,  and  z,  given  in  Exhibit  A.l  and  collect  the 
results  for  the  five  students  in  a  5  x  1  vector  we  obtain 


/l.S\ 

2.4 

2.9 

3.0 

\3.5j 


( b\  +  4 b2 
b\  +  6b2 
b\  +  6b2 
b\  +  7  b2 
\bi  +  8  b2 


4  b2\ 

5b2 
7bi 
6b3 
7b J 


/I 

1 

1 

1 

Vi 


4 

6 

6 

7 

8 


Let  y  denote  the  above  5x1  vector  of  FGPA  scores,  let  b  be  the  3x1  vector  with 
elements  b\,  b2,  b2,  and  let  X  be  the  above  5x3  matrix  with  the  first  column 
consisting  of  ones,  the  second  of  the  SATM  scores,  and  the  third  of  the  SATV  scores. 
Then  the  above  model  can  be  written  in  terms  of  the  given  data  vector  y  and  the 
given  data  matrix  X  as  y  =  Xb  where 
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x  = 


/ 1 

4 

4\ 

1 

6 

5 

1 

6 

7 

1 

7 

6 

\1 

8 

V 

This  is  a  system  of  five  equations  (one  for  each  student)  and  three  unknowns  (the 
values  of  b\ ,  bi,  b$).  There  does  not  exist  a  solution  for  the  three  unknowns  so 
that  all  five  equations  are  exactly  satisfied.  Approximate  solutions  can  be  obtained 
by  least  squares  —  that  is,  by  minimizing  (y,  —  b\  —  bix,  —  b^z,)2.  Tet 

e,  =  ji  —  b\  —  biXi  —  bj,Zi  be  the  error  of  the  equation  for  student  i;  then  this 
corresponds  to  minimizing  J2l=i  ef  by  choosing  appropriate  values  for 
b i,  b2,  b^.  We  will  later  come  back  to  this  (see  Example  A. 11). 


A.4  Transpose,  trace,  and  inverse 

Used  in  Chapters  1,  3-7. 


Transpose  of  a  matrix 

Let  A  be  a  given  p  x  q  matrix  with  elements  a,j,  i  =  1,  •  •  • ,  p,  j  =  1,  •  •  • ,  q.  Then 
the  transpose  of  A,  denoted  by  A',  is  the  q  x  p  matrix  with  the  value  of  <j;/  placed  in 
row  j  and  column  i.  For  instance,  the  transpose  of  the  5x3  matrix  X  in  Example 
A. 2  is  the  3x5  matrix 


/I  1  1  1  1 \ 

X'  =  4  6  6  7  8  . 

\4  5  7  6  7/ 

A  square  p  xp  matrix  A  is  called  symmetric  if  A'  =  A  —  that  is,  if  zZ/;-  =  dj,  for 
all  i  =  1,  •  •  • ,  p,  j  =  1,  •  •  • ,  p.  Some  calculation  rules  for  transposed  matrices  are 

(A')'=A,  (A  +  B)' =  A' +  B',  (AB)'  =  B'A'. 

If  a  is  a  p  x  1  vector,  then  a! a  =  i  af  *s  a  scalar  number  equal  to  the  sum  of  the 

squares  of  the  elements  a,  of  the  vector  a. 

Partitioned  matrix 

In  some  cases  it  is  convenient  to  work  with  partitioned  matrices  —  that  is,  matrices 
that  are  split  up  in  parts.  For  instance,  let  A  be  a  p  x  q  matrix  and  let  A i  be  the 
p  x  s  submatrix  consisting  of  the  first  s  columns  of  A  and  Ai  the  p  x  (q  —  s) 
submatrix  consisting  of  the  remaining  (q  —  s)  columns  of  A;  then  we  write 
A  =  (Ai  A2).  Let  the  rows  of  the  q  x  r  matrix  B  be  split  in  a  similar  way  in  the 
s  x  r  submatrix  f>i  consisting  of  the  first  s  rows  of  B  and  the  (q  —  s)  x  r  submatrix 
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f>2  consisting  of  the  remaining  rows  of  B  so  that  B=  j.  Then 

AB  =  A\B i  +  A2B2.  More  in  general,  let  the  ;th  column  of  the  p  x  q  matrix  A 
be  denoted  by  ctj  (apxl  vector)  and  let  the  jth  row  of  the  q  x  r  matrix  B  be 
denoted  by  b'-  (alxr  row  vector),  j  =  1,  •  •  • ,  q,  then  AB  =  ai^'j- 

Trace  of  a  square  matrix 

The  trace  of  a  square  p  x  p  matrix  A,  denoted  by  tr(A),  is  defined  as  the  sum  of  its 
diagonal  elements  —  that  is, 


P 

tr(A)  = 

i=l 

The  following  calculation  rules  hold  true  (assuming  that  the  shown  matrix  sums 
and  products  are  square  matrices): 

tr(A  +  B)  =  tr(A)  +  tr(B),  tr(AB)  =  tr(BA),  tr(A')  =  tr(A). 


Inverse  of  a  square  matrix 

A  square  p  x  p  matrix  A  is  called  invertible  if  there  exists  a  p  x  p  matrix  B  such 
that  AB  =  BA  =  Ip,  the  p  x  p  identity  matrix.  Such  a  matrix  B  is  called  the  inverse 
matrix  of  A  and  is  denoted  by  A-1. 

If  A  is  a  given  invertible  p  x  p  matrix  and  b  is  a  given  p  x  1  vector,  then 
there  exists  a  unique  p  x  1  vector  c  such  that  Ac  =  b  —  namely,  c  =  A~1b.  The 
following  computation  rules  apply,  where  A  and  B  are  square  p  x  p  invertible 
matrices: 


(AB)-1  =B~1A~1,  (A'p1  =  (A-1)',  (A-1r1=A. 

Not  every  square  matrix  is  invertible  —  for  instance,  if  A  =  O  is  a  square  zero 
matrix  then  it  has  no  inverse.  A  square  matrix  A  is  invertible  if  and  only  if  its 
determinant  is  non-zero,  as  is  further  discussed  in  the  next  section. 


E 


Example  A.3:  Simulated  Data  on  Student  Learning  (continued) 

In  Example  A. 2  we  mentioned  that,  for  given  observations  collected  in  the  5x1 
vector  y  and  the  5x3  matrix  X,  least  squares  corresponds  to  choosing  values  for 
b  1,  ^2,  bi,  such  that  the  sum  of  squares  of  the  errors  e,-  =  y-,  —  b\  —  ^2%,  —  b^Zi  is  as 
small  as  possible.  Tet  e  be  the  5x1  vector  with  elements  e,;  then  we  can  write 
e  =  y  —  Xb  and 


5 

e f  =  e' e  =  (y  —  Xb)'(y  —  Xb). 
i=  1 
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This  expression  can  be  worked  out  by  using  (y  —  Xb)'  =  y'  —  (Xb)'  =  y'  —  b'X' 
so  that  (y  —  Xb)'(y  —  Xb)  =  y'(y  —  Xb)  —  b'X'(y  —  Xb)  =  y'y  —  y/Xb  —  b'X'y  + 
b'X'Xb.  As  y  is  a  5  x  1  vector,  X  a  5  x  3  matrix,  and  b  a  3  x  1  vector,  it  follows 
that  y'Xb  is  a  scalar  (lxl  matrix),  which  is  of  course  symmetric,  so  that 
y'Xb  =  (y'Xb)1  =  b'(y'X)'  =  b'X'y.  Combining  these  results,  the  sum  of  squared 
errors  becomes 


(y  -  Xb)'(y  -  Xb)  =  y'y  -  2 b'X'y  +  b'X'Xb. 


We  can  use  the  numerical  values  in  Exhibit  A.  1  to  compute  y'y  =  yf  =  38.66 
and 


x'y  = 


( 1111 

X'X=  4  6  6  7 
\4  5  7  6 


/1.8\ 


1  1 

1\ 

2.4 

6  7  8 

2.9 

7  6 

-  7 

3.0 

'  i 

\  3-5  / 

(1  4 

4\ 

A 

1  6 

5 

8 

1  6 

7 

= 

7/ 

1  7 

6 

^1  8 

V 

5 

31 

29  \ 

31 

201 

186 

29 

186 

175  / 

(A.l) 


The  matrix  X'X  is  symmetric,  which  also  follows  from  (X'X)'  =  X'(X')'  =  X'X. 
The  above  results  lead  to  the  following  expression  for  the  sum  of  squared  errors, 
which  will  be  of  later  use. 


(y  —  Xb)' (y  —  Xb)  =38.66  —  2(bi  b2  b3) 


(b i  b2  b2)X’X 


=  38.66  -  27.2hi  -  176h2  -  164h3  +  5b\  4 
+  17 5b\  +  62 bib2  +  58hih3  +  372b2b3- 


201  b\ 


(A.2) 


The  computation  of  the  inverse  of  X'X  will  be  considered  in  the  next  section  (see 
Example  A.5). 


A.5  Determinant,  rank,  and  eigenvalues 

Used  in  Sections  1.2,  5.7,  7.6,  7.7. 

As  the  results  of  this  section  are  mostly  of  a  computational  nature,  readers  can 
skip  the  details  without  much  cost  and  leave  the  actual  computation  of  inverses, 
determinants,  and  eigenvalues  to  (matrix)  software  packages.  Some  of  the  details 
are  needed  in  Section  7.6. 
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Determinant  of  a  square  matrix 

A  square  p  x  p  matrix  A  is  invertible  if  and  only  if  its  determinant  is  non-zero. 
Here  the  determinant,  denoted  by  det(A),  is  a  scalar  number  that  can  be  computed 
from  the  elements  (i  =  1,  •  •  •  ,p,  j  =  1,  •  •  •  ,p).  For  a  scalar  (lxl)  matrix 
A  =  (flu)  the  determinant  is  simply  defined  by  det(A)  =  an-  For  a  2  x  2  matrix 
the  determinant  is  defined  by 

det  ((  11  ^  =  <Jn<J22  —  <Jl2<J21, 

\\  #21  <*22  J  J 

and  for  a  3  x  3  matrix  the  determinant  is  defined  by 


det 


<Jll 

<Jl2 

<Jl3 

<J21 

<722 

<723 

<731 

<732 

<733 

—  <711<J22<733  +  <J12<J23<731  +  <J13<J21<J32 


—  <J11<J23<J32  —  <J12<J21<J33  —  <J13<J22<J31  ■ 


E 


Example  A. 4:  Simulated  Data  on  Student  Learning  (continued) 

The  determinant  of  the  3x3  matrix  X'X  in  (A.l)  is  equal  to  5 (20 1 ) ( 1 75)  + 
31(186)(29)  +  29(31)(186)  -  5(186)(186)  -  31(31)(175)  -  29(201)(29)  =  107. 


Computation  of  determinant 

The  determinant  of  a  p  x  p  matrix  A  can  be  computed  from  the  determinants  of 
smaller-sized  submatrices  by  expansion  according  to  any  of  the  rows  or  columns  of 
A.  Let  A,j  be  the  (p  —  1)  x  (p  —  1)  matrix  obtained  by  deleting  the  zth  row  and  the 
;th  column  of  A  and  let  the  cofactor  Q,  be  defined  by  C,y  =  (  —  l)!+/det(A,y),  then 

p  P 

det(A)  —  ^  ~  <J jj  Cjj  ^  <J;y Cjj . 

/— 1  <=1 

The  first  expression  is  valid  for  any  choice  of  i  and  corresponds  to  an  expansion 
according  to  the  ith  row  of  A,  and  the  second  expression  is  valid  for  any  choice  of  j 
and  corresponds  to  an  expansion  according  to  the  /th  column  of  A.  Whatever 
choice  we  make  for  1  or  j,  the  above  expansions  always  lead  to  the  same  numerical 
outcome.  For  example,  the  determinant  of  a  4  x  4  matrix  A  can  be  obtained  by 
expansion  according  to  the  first  row  as  <JnCn  +  <Ji2Ci2+  <Ji3Ci3  +  AmCm,  where 
each  of  the  four  cofactors  Cy,  j  =  1,  •  •  •  ,4,  involves  the  determinant  of  a  3  x  3 
submatrix  of  A  that  can  be  computed  as  indicated  above. 

Computation  of  inverse  matrix 

A  matrix  is  invertible  if  and  only  if  its  determinant  is  non-zero,  and  the  (j,/)th 
element  of  the  inverse  matrix  can  be  computed  by 
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Cji/det(A), 

that  is,  the  element  on  row  i  and  column  j  of  A-1  is  equal  to  the  cofactor  C7, 
divided  by  the  determinant  of  the  full  p  x  p  matrix  A.  For  instance,  the  inverse 
of  a  2  x  2  matrix  A  can  be  computed  as  follows.  The  cofactors  are  equal  to 
C„  =  #22,  C12  =  — <321 ,  C21  =  —#12,  and  C22  =  #11,  so  that 

#11  #12  \  _ _ 1 _ f  #22  —#12 

#21  #22/  #11#22  — #12#21  \  #21  #11 

A  direct  computation  shows  that  indeed  AA_1  =  A_1A  =  J2-  Let  C  be  the  p  x  p 
matrix  with  elements  C,y;  then  CA  =  AC  =  det (A)I  —  that  is,  the  p  x  p  diagonal 
matrix  with  the  value  det(A)  on  the  diagonal. 

The  following  computation  rules  apply  for  determinants,  where  A  and  B  are 
square  p  x  p  matrices: 

det(A')  =  det(A),  det(AB)  =  det(A)det(B),  det(A-1)  =  1  /  det(A). 


Example  A.5:  Simulated  Data  on  Student  Learning  (continued) 

As  the  determinant  of  the  matrix  X'X  is  non-zero  (see  Example  A.4),  it  has  an 
inverse.  Tet  the  2x2  cofactors  of  X'X  be  denoted  by  C,;  (i  =  1, 2, 3,  ;  =  1, 2, 3), 
then  the  (/,/)th  element  of  (X'X)-1  is  equal  to  C/;/det(X'X)  =  C/,/107.  The  nine 
cofactors  are  easily  obtained  from  (A.l) — for  instance,  Cn  =  (  —  1)2(201  ■  175— 
1862)  =  579,  C12  =  (  -  1  )3 (3 1  •  175  -  186  •  29)  =  -31,  and  so  on.  This  gives 

-31  -63 \ 

34  -31  .  (A.3) 

-31  44  J 

From  working  out  the  matrix  products  it  follows  that  (X'X)(X'X)-1  = 
(X'X)-1  (X'X)  =  I3.  Note  that  X'X  is  a  symmetric  matrix  and  that  (X'X)-1  is 
also  symmetric. 

Rank  of  a  matrix 

If  the  determinant  of  a  square  matrix  A  is  zero,  then  there  does  not  exist  an  inverse 
matrix  of  A.  Such  a  matrix  is  called  non-invertible  or  singular. 

The  rank  of  a  p  x  q  matrix  A  is  equal  to  the  largest  number  r  for  which  there 
exists  a  square  submatrix  of  A  of  size  r  x  r  that  has  a  non-zero  determinant.  Here 
square  submatrices  of  size  r  x  r  are  obtained  by  choosing  the  elements  on  the  cross 
points  of  r  chosen  rows  of  the  p  rows  of  A  and  r  chosen  columns  of  the  q  columns 
of  A.  The  rank  of  a  p  x  q  matrix  A  satisfies  rank(A)  <  min (p,q)  and  the  rank  of 
the  product  of  two  matrices  satisfies  rank(AB)  <  min(rank(A),rank(B) ). 

If  a  p  x  q  matrix  A  has  rank  r  <  q,  then  there  exists  a  non-zero  q  x  1  vector  b 
such  that  Ab  =  0.  One  then  says  that  the  q  columns  of  A  are  linearly  dependent. 
If  a  square  p  x  p  matrix  A  has  det(A)  =  0,  then  rank(A)  <  p  and  there  exists  a 


579 

M-1  “I  -31 


107 


-63 


E 
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E 


non-zero  p  x  1  vector  b  such  that  Ab  =  0.  If  A  is  invertible,  then  the  only  vector  b 
for  which  Ab  =  0  is  b  =  0.  If  a  p  x  p  matrix  A  has  rank  r  <  p,  then  there  exist 
p  x  r  matrices  B  and  C  with  rank(B)  =  rank(C)  =  r  such  that  A  =  BO. 

Example  A. 6:  Simulated  Data  on  Student  Learning  (continued) 

Consider  the  3x5  matrix 


(l  1  1  1  1  \ 

X'  =  4  6  6  7  8  . 

\4  5  7  6  7/ 

The  3x3  submatrix  consisting  of  columns  2, 4,  and  5  has  determinant 
(49  +  40  +  36  —  48  —  42  —  35)  =  0,  but  the  submatrix  consisting  of  the  first 
three  columns  has  determinant  (42  +  24  +  20  —  30  —  28  —24)  =  4^0.  This 
shows  that  rank(X')  =  3.  As  X'  has  five  columns,  there  exists  a  non-zero  5x1 
vector  b  so  that  X'b  =  0.  For  instance,  b  =  (—2, 5, 1,  —4, 0)'  is  such  a  vector.  The 
five  columns  c,  (3x1  vectors,  i=  1,  •  •  - ,  5)  of  X'  are  linearly  dependent  —  for 
instance,  C3  =  2ci  —  5c2  +  4c4. 

Eigenvalues  and  eigenvectors 

A  (possibly  complex)  number  2  is  called  an  eigenvalue  of  the  square  p  x  p  matrix 
A  if  det(A  —  XIp)  =  0.  When  seen  as  a  function  of  the  (complex)  variable 
z,  f(z )  =  det(A  —  zip)  is  a  polynomial  in  z  of  order  p.  This  is  called  the  characteris¬ 
tic  polynomial  of  A  and  it  has  p  (possibly  non-distinct)  roots  2,,  i  =  1,  •  •  • ,  p  —  that 
is,  values  for  which  /’(2/)  =  0.  These  roots  are  the  eigenvalues  of  A,  and  for  each 
root  2,-  there  exists  a  (possibly  complex-valued)  non-zero  p  x  1  vector  v,  such  that 
(A  —  AtIp)v,  =  0  —  that  is,  such  that  Avt  =  2+,.  So  if  the  vector  v,  is  multiplied  by 
the  matrix  A,  then  the  resulting  vector  is  a  multiple  of  v,.  Such  a  vector  is  called  an 
eigenvector  of  the  matrix  A  corresponding  to  the  eigenvalue  2,. 

Let  the  square  p  x  p  matrix  A  have  eigenvalues  2j,  •  •  • ,  2p;  then  the  determinant 
of  A  is  equal  to  the  product  Yu=\  k  and  the  trace  of  A  is  equal  to  the  sum  Ym=i  A- 
Further,  let  Ak  be  the  matrix  product  A  x  A  x  •  •  •  x  A  (with  k  terms  A);  then 
Ak  — >  0  for  k  — >  00  if  and  only  if  all  eigenvalues  2,-  lie  inside  the  unit  circle  in  the 
complex  plane. 

Eigenvalue  decomposition  of  a  symmetric  matrix 

If  the  matrix  A  (with  real-valued  for  all  i,  j )  is  symmetric,  then  all  its  eigenvalues 
are  real-valued.  Moreover  there  exist  p  real-valued  eigenvectors  v,  such  that 
Av,  =  a,v,  with  the  properties  that  v\v,  =  1  for  all  i  =  1  ,■■■  ,p,  and  v'tVj  =  0  for 
all  /  ^  j.  Let  V  be  the  p  x  p  matrix  with  2th  column  +;  then  it  follows  that  V'V  =  1 
so  that  V-1  =  V'.  Such  a  matrix  is  called  orthogonal.  Further  let  D  be  the  p  x  p 
diagonal  matrix  with  the  values  2i,---,2p  on  the  diagonal;  then  there  holds 
A  V  =  A(v  1  •  •  •  Vp)  =  (Av  1  •  •  •  Avp)  =  (2ivi  •  •  •  lpVp)  =  (v\  ■  ■  ■  vp)D  =  VD  and  hence 
A  =  VDV =  VDV'.  Summarizing,  every  symmetric  p  x  p  matrix  A  (with  real¬ 
valued  elements  a-,j)  can  be  written  as 
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A  =  VDV',  (A.4) 

where  D  is  a  diagonal  matrix  and  V  an  orthogonal  matrix  with  the  property  that 
V'V  =  VV  =  Ip. 


Example  A. 7:  Simulated  Data  on  Student  Learning  (continued) 

We  consider  the  symmetric  matrix  X'X  in  (A.l).  If  the  calculation  rule  for  the 
determinant  of  the  3x3  matrix  X'X  —  zh  is  applied,  the  characteristic  polyno¬ 
mial  is  obtained  as 

f(z)  =  det(X'X  -  zl)  =  - z 3  +  381z2  -  657z  +  107. 

Exhibit  A. 2  shows  the  values  of  this  polynomial  for  real  values  of  z.  It  is  seen 
that  this  polynomial  has  three  positive  roots.  So  the  three  eigenvalues  are  real 
and  positive,  with  (rounded)  values  X\  =  0.18208,  b  =  1.54946,  and  A3  = 


Exhibit  A. 2  Simulated  Data  on  Student  Learning  (Example  A.7) 

Characteristic  polynomial  f(z)  of  the  3x3  matrix  X'X  (a)  with  details  on  two  subintervals, 
one  near  the  roots  0.18  and  1.55  ( b )  and  the  other  near  the  root  379.27  (c). 
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379.26846.  Corresponding  eigenvectors  v\,  v2,  v2  are  given  by  the  columns  of  the 
following  matrix  V  (again  only  the  rounded  values  are  given;  more  precise  values 
can  be  obtained  by  matrix  software  packages): 

/  0.99246  0.04813  0.11270\ 

V  =  (vu  v2,  =  -0.04929  -0.68522  0.72667  . 

V -0.11220  0.72675  0.67768  / 

By  computing  the  respective  matrix  products  one  can  directly  verify  that  (up 
to  rounding  errors)  X'Xvt  =  k,v,  (i  =  1 , 2, 3)  and  that  V'V=VV'  =  Ii  and 
VDV'  =  X'X,  where  D  is  the  diagonal  matrix  with  the  elements  Ai,A2,A3  on  the 
diagonal. 


A. 6  Positive  (semi)definite  matrices  and  projections 

Used  in  Chapters  1,  3,  5,  7. 


Positive  (semi)definite  matrix 

Let  A  be  a  square  p  x  p  symmetric  matrix;  then  for  every  p  x  1  vector  b  the 
product  b'Ab  is  a  lxl  matrix  —  that  is,  this  product  is  a  scalar  number.  A 
symmetric  matrix  A  is  called  positive  definite  if  b'Ab  >  0  for  every  non-zero 
vector  b.  It  is  called  positive  semidehnite  if  b'Ab  >  0  for  all  vectors  b.  It  is  called 
negative  definite  if  b'Ab  <  0  for  every  non-zero  vector  b  and  it  is  called  negative 
semidehnite  if  b'Ab  <  0  for  all  vectors  b.  Let  A  be  a  p  x  q  matrix;  then  AA!  and 
A! A  are  positive  semidehnite  matrices.  For  instance,  for  every  q  x  1  vector  b  there 
holds  b'A'Ab  =  (Ab)'(Ab)  =  c'c  =  Y?j=  1  c f  —  where  c;  (/  =  1,  •  •  • ,  p)  are  the 
elements  of  the  p  x  1  vector  c  =  Ab.  If  rank(A)  =  q,  then  the  q  x  q  matrix  A' A 
has  rank  q  and  it  is  positive  dehnite. 

Square  root  of  a  positive  definite  matrix 

If  the  symmetric  matrix  A  is  positive  dehnite,  then  it  has  an  inverse  A-1  and  this 
matrix  is  also  positive  dehnite.  Further,  if  A  is  a  p  x  p  symmetric  positive  dehnite 
matrix,  then  there  exist  a  p  x  p  matrix  B  and  p  x  p  symmetric  positive 
dehnite  matrices  C\  and  C2,  such  that 

BAB'  =  Ip,  C,  C,  =  A,  C2C2  =  A-1. 

This  can  be  proved  by  means  of  the  decomposition  A  =  VDV' in  (A. 4),  where  Vis 
an  orthogonal  matrix  with  VV'  =  V'V  =  Ip  and  D  is  a  diagonal  matrix  with 
elements  k\,  ■  ■  ■ ,  Xp.  Let  v,  be  the  zth  column  of  V;  then,  because  A  is  positive 
dehnite,  it  follows  that  v\Avi  =  A,-  >  0.  Let  D1/2  be  the  diagonal  matrix  with 
elements  on  the  diagonal  and  let  D-1/2  be  the  diagonal  matrix  with  elements 
1  /y/Ii  on  the  diagonal.  Then  B  =  D^V',  C2  =  VDl!2V  and  C2  =  VD^^V' 
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have  the  properties  mentioned  above.  The  matrix  C\  is  called  a  square  root  of  the 
matrix  A  and  C2  is  a  square  root  of  A-1. 

Example  A. 8:  Simulated  Data  on  Student  Learning  (continued) 

The  5x3  matrix  X  has  rank  three,  and  the  foregoing  results  imply  that  X'X  and 
(X'X)-1  are  positive  definite.  We  check  this  for  the  matrix  X'X  in  (A.l)  and  leave 
the  other  one  as  an  exercise  (use  the  numerical  values  for  (X'X)-1  obtained  in 
(A. 3)  in  Example  A. 5).  Let  b  be  a  3  x  1  vector  with  elements  b i,  b2,  b3,  then 

b'X'Xb  =  5b\  +  201^  +  ^5b\  +  62^2  +  58M3  +  372 b2b2 

=  5(b\  +  6.2b2  +  5.8b2)2  +  (201  -  5(6.2)2)b22  +  (175  -  5(5.8 )2)b\ 

+  (372  —  10(6.2)(5.8)  )b2b2 
=  5(b\  +  6.2b2  +  5.8b2)2  +  8.8  b\  +  6.8  b\  +  12.4M3 

=  5(h  +  6.2 b2  +  5.8 b2)2  +  8.8 (b2  +  +  (h.8  -  8.8  bl 

=  S(bi  +  6.2  b2  +  5.8  b3)2  +  8.8  (b2  +  +^b2. 

As  this  is  a  sum  of  three  squared  terms  with  positive  weights  it  follows  that 
b'X'Xb  >  0,  and  b'X'Xb  =  0  if  and  only  if  all  three  terms  are  zero.  The  last  term 
in  the  sum  shows  that  then  b2  =  0,  the  middle  term  then  implies  that  b2  =  0,  and 
subsequently  the  first  term  implies  that  also  b\=0.  Stated  otherwise,  for  b  ^  0  we 
have  b'X'Xb  >  0  and  this  shows  that  X'X  is  positive  definite. 

Projection  matrix 

A  square  p  x  p  matrix  A  is  called  idempotent  if  A  A  =  A.  A  symmetric  idempotent 
matrix  is  called  a  projection  matrix.  A  projection  matrix  is  positive  semidehnite 
because  A  =  AA  =  A' A.  If  A  is  a  p  x  q  matrix  with  rank(A)  =  q,  so  that  A' A  is  an 
invertible  q  x  q  matrix,  then  P  =  A(A'A)-1A'  is  a  p  x  p  projection  matrix  with 
rank(P)  =  q.  Because  PA  =  A  this  means  that  every  column  of  A  remains  un¬ 
changed  when  multiplied  by  P,  so  that  P  has  q  eigenvalues  equal  to  one.  The  other 
(p  —  q)  eigenvalues  of  P  are  zero.  A  p  x  p  projection  matrix  A  with  rank(A)  =  r 
can  be  written  as  A  =  K'K  for  anrxp  matrix  K  with  KK'  =  /,  .  To  construct  K,  let 
A  =  VDV'  be  the  decomposition  (A. 4),  where  the  diagonal  matrix  D  contains  the 
eigenvalues  of  A  —  that  is,  the  diagonal  contains  r  times  a  one  on  the  diagonal 
followed  by  (p  —  q)  times  a  zero.  Let  Vr  be  the  p  x  r  submatrix  consisting  of  the 
first  r  columns  of  V;  then  K  =  V'r  has  the  stated  properties. 

Example  A. 9:  Simulated  Data  on  Student  Learning  (continued) 

The  5x3  matrix  X  that  we  considered  before  has  rank  3  (see  Example  A.6).  If  we 
use  the  result  for  the  inverse  (X'X)-1  in  (A. 3)  in  Example  A. 5,  the  matrix 
P  =  X(X'X)-1X'  is  equal  to 


E 
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Clearly,  P  is  symmetric  and  one  can  check  by  direct  computation  that  PP  =  P,  so 
that  P  is  also  idempotent.  So  P  is  a  projection  matrix.  According  to  earlier  results 
tr(P)  =  X]j=i  k  =  3,  as  P  has  three  eigenvalues  equal  to  one  and  two  eigenvalues 
equal  to  zero.  It  is  easily  checked  that  the  trace  of  the  above  matrix  P  is  indeed 
equal  to  three.  By  applying  a  matrix  software  package  to  determine  the  eigenvec¬ 
tors  of  P,  we  can  compute  the  following  (rounded)  3x5  matrix  K  (here  we  do  not 
show  the  calculation  of  the  required  eigenvalues  and  eigenvectors  to  compute 
K=V'): 


/  -0.08169 
K  =  -0.84091 

V  0.24880 


-0.53396  0.42535 
-0.29451  0.08260 

0.10645  0.86959 


-0.52027 

0.07297 

0.22606 


-0.50657X 
0.44045  . 

0.34566  ) 


It  is  a  matter  of  direct  computation  to  check  that  (up  to  rounding  errors)  K! K  =  P 
and  KK!  =  h. 


A.7  Optimization  of  a  function  of  several  variables 

Used  in  Chapters  1-7. 


Notation 

An  econometric  model  often  contains  a  number  of  unknown  parameters  that  are 
estimated  by  optimizing  a  numerical  criterion.  Examples  are  least  squares  (dis¬ 
cussed  in  Chapters  2  and  3)  and  maximum  likelihood  (discussed  in  Chapter  4).  We 
denote  the  unknown  parameters  by  the  p  x  1  vector  b  and  the  criterion  function 
by  f(b),  where  f(b)  is  a  real  number  that  depends  on  the  value  of  b.  For  instance,  in 
our  example  of  student  scores  the  least  squares  criterion  (A. 2)  is  a  function  that 
takes  on  non-negative  values  that  depend  on  the  chosen  values  of  the  three 
parameters  b\,  bz,  b 3. 
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Continuous  and  differentiable  functions 

First  we  summarize  some  concepts  and  results  for  functions  of  a  single  variable  — 
that  is,  with  p  =  1.  A  function  /(h)  of  a  single  variable  b  is  called  continuous  if,  for 
every  value  of  bo,  the  function  values  f(b)  are  close  to  f(bo)  if  b  is  close  to  bo.  This 
is  written  as  lim&_&0 /(h)  = /(ho),  and  the  formal  definition  is  that  for  every 
number  hi  >  0  there  exists  a  number  ^2  >  0  with  the  property  that 
\f(b)  —  f(bo)\  <  hi  for  all  | b  —  bo\  <  Si-  The  function  /  is  called  differentiable  if, 
for  every  value  of  bo,  there  exists  a  value  (say  ao),  such  that 


f(b0  +  h)  -  f(b0)  -  a0h 
hm^o - 7 - 


=  0. 


We  will  write  this  as  f(bo  +  h)—f(bo)—aob~0  if  h « 0,  or  also  as 
f(bo  +  h)  ss  f(bo)  +  hao  if  h  s=s  0.  The  value  ao  is  called  the  derivative  of  /  at  bo 
and  is  written  as  ^  (bo),  and  the  function  ^  (b),  seen  as  a  function  of  b,  is  called  the 
derivative  of  /.  Writing  b  =  bo  +  b,  we  obtain  from  the  foregoing  that 


f(b )  «  /(ho)  +  ^  (h0)  •  (b  -  b0)  if  h  «  h0. 

For  fixed  value  of  bo  the  right-hand  side  of  the  above  expression  is  a  linear  function 
of  h,  which  is  called  the  linear  approximation  of  the  function  /  at  ho- 


Maxima  and  minima 

The  function  /  is  said  to  have  a  global  maximum  at  ho  if  /(ho)  >  /(h)  for  all  values 
of  h.  The  function  /  has  a  local  maximum  at  ho  if  /(ho)  >  /(h)  for  all  h  close  to  ho, 
formally,  if  there  exists  a  S  >  0  such  that  /(ho)  >  /(h)  for  all  |h  —  ho |  <  <5.  The 
function  /  has  a  global  (local)  minimum  at  bo  if  /(ho)  <  /(h)  for  all  h  (respectively 
for  all  h  close  to  ho).  If  /  is  differentiable,  then 

|(W  =  o 

for  all  values  of  ho  where  /  has  a  (global  or  local)  maximum  or  minimum.  This  is 
called  the  first  order  condition  for  a  maximum  or  minimum  of  the  function  /.  To 
distinguish  between  maxima  and  minima  we  consider  the  second  derivative  of 
/  —  that  is,  the  derivative  of  ^ ,  where  we  assume  that  this  is  a  differentiable 
function.  The  second  derivative  is  denoted  by  /  { ■  The  function  /  has  a  local 
maximum  at  ho  if  775 (ho)  =  0  and  ‘-^t(bo)  <  0,  and  it  has  a  local  minimum  if 
|(ho)  =  Oand0(ho)>O. 


Continuity  of  a  function  of  several  variables 

Now  we  extend  the  concepts  above  to  functions  /(h)  of  several  variables  —  that  is, 
where  h  is  a  p  x  1  vector  with  p  >  1.  The  function  /(h)  is  called  continuous  if,  for 
every  given  p  x  1  vector  bo,  the  function  values  /(h)  are  close  to  /(ho)  if  b  is  close  to 
ho-  Formally,  the  function  is  continuous  at  bo  if  for  every  hi  >  0  there  exists  a 
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<5^  >  0  such  that  \f(b)  —  f(bo)\  <  hi  for  all  || b  —  ho||  <  d2 
the  distance  between  the  two  p  x  1  vectors  b 
\\b-b0\\  =  ('ELi(bi-boi)2)12 
and  bo  respectively. 


Here  \\b  —  bo\\  denotes 
and  bo,  defined  by 
where  bj  and  boi  denote  the  components  of  b 


Differentiability  of  a  function  of  several  variables 

The  function  f(b)  is  said  to  be  partially  differentiable  with  respect  to  the  fth 
component  b,  if,  for  every  given  p  x  1  vector  bo  =  (boi,  •  •  • ,  bop)',  the  function 
g(bi)  =  f(bo\,  ■  ■  ■ ,  boi-i,  b,,  boi+i,  ■  ■  ■ ,  bop)  is  a  differentiable  function  of  the 
(single)  variable  b,.  The  /th  partial  derivative  of  /  at  bo  is  defined  as  ^(ho)  and 
is  denoted  by  (bo).  If  the  function  /  is  partially  differentiable  with  respect  to  all 
its  p  components,  then  the  gradient  of  /  at  bo  is  defined  as  the  p  x  1  vector  with  the 
p  partial  derivatives  as  elements.  The  gradient  is  denoted  by  so  that 


/'£<* o>\ 
Ijh) 

J 


From  the  results  for  functions  of  a  single  variable  it  follows  that 
f(b)  «  /(ho)  +  §i;(bo)  ■  (bj  -  ho/)  if  h,  «  h0)-  and  bj  =  b0j  for  all  /  ±  /  —  that  is,  if 
only  the  /th  component  of  h  varies.  The  function  /  is  called  differentiable  if  for 
every  p  x  1  vector  ho  there  exists  a  p  x  1  vector  ao  such  that 


lim, 


■h—>  0 


/(h0  +h)~  /(h0)  -  a'0b 

\M 


=  0. 


Here  h  denotes  a  p  x  1  vector  and  h  — >  0  means  that  ||h||  =  (^T=1  h?)^“— >  0.  In 
particular,  if  we  take  h  =  (0,  •  •  • ,  0,  h„  0,  •  •  • ,  0)'  with  a  single  element  h,  on  the 
/th  position  and  zeros  elsewhere,  then  f(bo  +  h)  is  a  function  only  of  the  /th 
component  of  h,  so  that  flo/  =  ^-(ho)  and  ao  is  the  gradient  of  /  at  bo-  Let 
h  =  bo  +  h;  then  we  can  write  the  above  result  as 


/(h)  «/(h0)+  (|^(ho) 


(h  —  ho)  if  h 


bo. 


This  is  called  the  linear  approximation  of  the  function  /  at  ho. 


E 


Example  A. 10:  Simulated  Data  on  Student  Learning  (continued) 

We  consider  the  sum  of  squares  /(h)  =  (y  —  Xb)'(y  —  Xb)  of  p  =  3  variables 
defined  in  (A. 2)  in  Example  A.  3  —  that  is, 

/(h)  =  38.66  -  27.2hi  -  176h2  -  164h3  +  5b\  +  201h^  +  1 75h^ 

+  62hih2  +  58hih3  +  372 h2h3. 
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The  gradient  of  f  at  h  is  equal  to 


dj_ 

db 


(  df\ 

Ml 

dL 

dbi 

vf; 


/  -27.2  +  10hi  +  62h2  +  58h3  \ 
-176  +  402h2  +  62hi +  372h3 
V  -164  +  350h3  +  58b\  +  372 b2  ) 


/  — 27.2\  /10 

-176  +  62 

\  -164  J  \  58 


62  58  \  ( b\\ 

402  372  h  . 
372  350 )  \b3) 


(A.5) 


If  we  compare  this  with  (A.l)  in  Example  A. 3,  then  we  see  that  the  gradient  of  the 
least  squares  criterion  ( y  —  Xb)'{y  —  Xb)  =  y'y  —  2 b'X'y  +  b'X'Xb  can  be  written 
as 


d(y  —  Xb)'[y  —  Xb) 
db 


—2X'y  +  2  X'Xb. 


Maxima  and  minima  of  functions  of  several  variables 

The  function  f  is  said  to  have  a  global  (local)  maximum  at  bo  if /(ho)  >  f(b)  for  all 
b  (respectively  for  all  b  close  to  bo),  and  a  global  (local)  minimum  if  /(ho)  <  f(b) 
for  all  b  (respectively  for  all  b  close  to  ho).  If  the  function  f  is  differentiable,  then  it 
satisfies  the  following  p  first  order  conditions  in  all  points  bo  where  f  has  a  local 
maximum  or  minimum: 


dj_ 

db 


(bo)  =  0. 


This  corresponds  to  p  equations  in  the  p  variables  bo-  To  distinguish  between 
maxima  and  minima  we  assume  that  the  function  /  is  twice  differentiable  —  that 
is,  that  each  of  the  p  functions  g,(b)  =  i^-(b)  is  a  differentiable  function  of  h.  Then 
the  p2  second  order  partial  derivatives  of  /  are  defined  by  Ji  f)b_  =  ^  for 
/  =  1,  •  •  • ,  p,  j  =  1,  •  •  • ,  p.  The  Hessian  matrix  of  f  at  bo  is  the  p  x  p  matrix  of 
second  order  derivatives  defined  by 


d2f 

dbdb 


7  (ho) 


(mk^b o)  jgfc(*o) 

dtJdbx  fio)  dbidb2  ^o) 

\dfpdbi  fio)  M~dbl  (^°) 


Sbp^b  o>\ 

dbjbp  (bo) 

dbpdbp  (bo)  J 


The  function  f  has  a  local  maximum  at  ho  if  it  satisfies  the  first  order  conditions  at 
ho  and  the  Hessian  matrix  at  ho  is  negative  definite.  This  maximum  is  global  if  the 
Hessian  matrix  is  negative  definite  for  all  p  x  1  vectors  h.  The  function  has  a  local 
minimum  at  ho  if  it  satisfies  the  first  order  conditions  at  ho  and  the  Hessian  matrix 
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XMA01SIM 


at  bo  is  positive  definite,  and  the  minimum  is  global  if  the  Hessian  matrix  is 
positive  definite  for  all  values  of  b. 


Example  A.1 1 :  Simulated  Data  on  Student  Learning  (continued) 

We  continue  our  analysis  of  Example  A.  10  of  the  sum  of  squares  function 
f(b)  =  (y  —  Xb)' (y  —  Xb)  in  (A.2).  To  determine  the  minimum  of  this  function 
we  first  solve  the  first  order  conditions  =  0.  According  to  the  foregoing  analysis 
in  Example  A.  10,  the  gradient  can  be  written  as  —2X'y  +  IX'Xb  =  0.  So  the  first 
order  conditions  correspond  to  the  three  linear  equations  X'Xb  =  X'y.  These  are 
called  the  normal  equations.  We  saw  before  that  X'X  is  an  invertible  matrix  and 
we  computed  the  inverse  matrix  in  Example  A. 5  (see  (A. 3)).  The  vector  X'y  was 
computed  in  Example  A. 3  (see  (A.l)).  If  we  use  these  results,  it  follows  that  the 
first  order  conditions  are  satisfied  for  the  unique  vector  b  given  by 


b  =  (X'Xr'X'y  =  — 


579 

-31 

-63  \ 

/  13.6 

-31 

34 

-31 

88.0 

-63 

-31 

44  / 

V  82.0 

1 

107 


-19.6  \ 
28.4  . 

23.2  ) 


To  prove  that  these  values  of  b  provide  a  minimum  of  f(b)  we  have  to  prove  that 
the  Hessian  matrix  at  this  point  is  positive  definite.  Front  the  gradient  in  (A. 5)  in 
Example  A.  10  we  obtain 


(  d2f 

d2f 

d2f 

db\dbi 

db\dbi 

db\dbo 

d2f 

d2f 

d2f 

d2f 

dbdb' 

dbidbi 

tf-t 

db2dbi 

d2f 

dbidbo 

d2f 

\db3db1  dbidbi  dbodboJ 


no 

62 

58 

62 

402 

372 

\  58 

372 

350 

So  the  Hessian  matrix  is  equal  to  2X'X  with  X'X  given  in  (A.l).  We  have  already 
checked  in  Example  A. 8  that  X'X  is  positive  definite.  As  the  Hessian  matrix  does 
not  depend  on  the  value  of  the  3x1  vector  b,  it  follows  that  the  computed  value  of 
b  =  (—  ,  y|y)  is  a  global  minimum  of  the  least  squares  criterion  f(b).  The 


Dependent  Variable:  FGPA 

Method:  Least  Squares 

Included  observations:  5 

Variable 

Coefficient 

CONSTANT 

-0.183178 

SATM 

0.265421 

SATV 

0.216822 

Sum  Squared  Residuals 

0.014766 

Exhibit  A.3  Simulated  Data  on  Student  Learning  (Example  A.ll) 

Output  of  an  econometric  software  package  for  the  least  squares  coefficients  and  the  corre¬ 
sponding  sum  of  squares  for  the  data  of  Exhibit  A.l  on  the  scores  FGPA,  SATM,  and  SATV of 
five  students. 
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vector  b  is  called  the  least  squares  estimate  of  the  parameters  b\,  b2,  b 3  for  the  score 
data  of  the  five  students,  and  the  corresponding  model  is  given  by  y  =  —  + 

TWX  +  T57z  ~  —0.183  +  0.265x  +  0.217z,  where  y  stands  for  FGPA,  x  for  SATM, 
and  z  for  SATV.  The  corresponding  minimum  sum  of  squares  is  obtained  from 
(A. 2)  in  Example  A.3,  with  (rounded)  value  /(—  ,  qjy)  ~  0.0148. 

For  comparison,  Exhibit  A.3  contains  the  output  of  an  econometric  software 
package. 


A. 8  Concentration  and  the  Lagrange  method 

Used  in  Chapter  4. 


Method  of  concentration  in  optimization 

To  determine  the  maximum  or  minimum  of  a  function  of  several  variables  it  is 
sometimes  helpful  to  use  the  so-called  concentration  method.  Let  f(b)  be  a 
function  of  the  p  x  1  vector  b  and  let  this  vector  be  split  in  two  parts,  a  p\  x  1 
vector  b\  and  a  (p  —  p  1)  x  1  vector  bz,  so  that  f(b)  can  be  written  as  f(b\,  bz)-  For 
given  values  of  b\  the  function  f(b-i,bz)  can  be  viewed  as  a  function  of  bz-  Let 
m(b\)  =  maxj,2  f(b  1,  bz)  be  the  maximum  (with  respect  to  bz)  of  f{b\,  bz)  for  given 
values  of  by,  then  the  maximum  value  m(b\)  is  a  function  of  b\  that  can  be 
maximized  (with  respect  to  b\ ).  There  holds 


ma x.bi,bif(bubz)  =  maxfel{maxfc  f{b\,b2)}. 


A  similar  result  holds  true  for  the  minimum  of  a  function.  The  advantage  of  the 
concentration  method  is  that  the  two  minimizations  involve  less  variables 
{{p  —  pi)  and  p\  respectively)  than  the  one-shot  minimization  of  f  with  respect 
to  all  its  p  components. 

Method  of  Lagrange 

As  a  final  topic  we  consider  the  maximization  or  minimization  of  functions  under 
restrictions.  Let  f  and  gj  (7  =  1,  •  •  • ,  r)  be  differentiable  functions  of  p  variables  b, 
and  suppose  we  wish  to  determine  the  maximum  or  minimum  of  the  function  f(b) 
under  the  restrictions  that  gj(b)  =  0  for  all  j=  l,  -,r.  We  suppose  that  the 
derivatives  of  f  and  gj  (/  =  1,  •  •  • ,  r)  are  all  continuous  functions  and  that  the 
p  x  r  matrix  with  columns  ||  (;  =  1,  •  •  • ,  r)  has  rank  r.  The  method  of  Lagrange 
states  that  the  constrained  maxima  and  minima  of  f(b)  satisfy  the  first  order 
conditions  for  the  Lagrange  function  A (b,X)  defined  by 


A (M)  =  f(b)  -^2^jgj(b). 

/—i 


The  corresponding  set  of  first  order  conditions  is  given  by 
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<9A 

db, 

dA 

dlj 


df 

db, 


r 


% 

db, 


=  0, 


i=  1 


gj(b)  =  0,  7=1, 


Solutions  for  this  set  of  (p  +  r)  equations  in  the  (p  +  r)  unknowns  (b,  d)  can  be 
obtained  by  numerical  methods. 


Interpretation  of  the  Lagrange  multipliers 

Let  (ho,  do)  be  a  solution  of  the  above  set  of  (p  +  r)  equations;  then  the  Lagrange 
multipliers  do/  have  the  following  interpretation.  Let  h  be  a  p  x  1  vector  with  small 
entries  such  that  g\(bo  +  h)  =  cc\  ^  0  and  g,(ho  +  h)  =  0  for  all  j  =  2,  •  •  • ,  r.  This 
corresponds  to  relaxing  the  first  restriction.  As  h  is  small,  it  follows  that 
ai  =gi(bo  +  b)  «  (^-(h0))'h  and  0  =  gy(h0  +  h)  «  (§f(ho))'h  for  /  =  2,  •  •  •  ,r, 
and  from  the  first  p  first  order  conditions  of  the  Lagrange  function  it  follows  that 

f(bo  +  h )  ~f(b0)  +  h  =  f(b0)  +  d0/  ^  (h°)^  h  «  /"(ho)  +  d0i<xi. 

That  is,  dot  measures  the  marginal  effect  on  the  value  of  the  function  f  when  the 
first  restriction  is  relaxed.  In  a  similar  way,  do,  measures  the  marginal  effect  on  the 
function  value  due  to  relaxing  the  ;th  restriction.  For  this  reason,  in  business  and 
economics  the  Lagrange  multipliers  do  are  also  called  the  shadow  prices  of  the  r 
restrictions. 


E 


Example  A. 12:  Simulated  Data  on  Student  Learning  (continued) 

We  consider  again  the  scores  of  the  five  students  and  we  minimize  the  sum  of 
squares  f(b)  =  (y  —  Xb)'(y  —  Xb)  in  (A. 2)  of  Example  A. 3  under  the  two  restric¬ 
tions  that  b2  =  0  and  63  =  0.  That  is,  we  impose  the  model  restriction  that  SATM 
and  SATV  do  not  affect  the  FGPA  scores.  The  corresponding  Lagrange  function  is 
A  =  f(b)  —  k\b2  —  X2b2,  and,  using  the  expression  (A.5)  in  Example  A. 10  for  the 
gradient  of  f,  we  obtain  the  following  first  order  conditions: 


dA 

dbi 

dA 

db2 

dA 

db3 

dA 

dh 

dA 

dh 


3-f-  =  -27.2  +  lOfci  +  62  bi  +  58bi  =  0, 

db  j 

df 

'-A1  =  -176  +  62b\  +  402b2  +  372 b3  -  h  =  0, 
0b2 

Of 

_L  -  22  =  -164  +  58^i  +  372b2  +  350fe3  -  h  =  0, 
0b3 

b2  =  0, 

bi  =  0. 
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Substituting  h2  =  b$  =  0  in  the  first  equation  gives  b i  =2.72  with  corresponding 
sum  of  squares  f(2. 72, 0, 0)  =  1.668.  The  second  and  third  equation  give 
A\  =  —7.36  and  22  =  —6.24.  For  instance,  if  we  relax  the  first  restriction  to 
bi  =  oci  =  0.001,  then  f(2. 72, 0.001, 0)  =  1.660841  and  the  ‘increase’  in  the 
sum  of  squares  is  1.660841  —  1.668  =  —0.007159  «  —0.00736  =  Aiai.  In  a  simi¬ 
lar  way,  relaxing  the  third  restriction  to  63  =  012  =  0.001  gives  f( 2.72, 0, 0.001)  = 
1.661935,  with  corresponding  ‘increase’  in  the  sum  of  squares  1.661935 
-1.668  =  -0.006065  w  -0.00624  =  22«2. 
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Exercise 


Consider  the  data  on  FGPA  (denoted  by  y),  SATM 
(denoted  by  x),  and  SATV  (denoted  by  z)  of  the  five 
students  in  Exhibit  A.l.  In  the  text  we  discussed  the 
model  y  =  b\  +  bxx  +  b3z.  In  this  exercise 
we  analyse  this  model  without  the  constant 
term  —  that  is, 


y  =  c\x  +  c2Z- 

a.  Write  this  model  in  the  form  y  =  Zc  where  y  is 
the  vector  of  FGPA  scores  and  Z  the  matrix  with 
SATM  and  SATV  scores  and  where  c  is  the  2x1 
vector  with  the  coefficients  c\  and  c2- 

b.  Compute  the  matrix  Z'Z  and  its  inverse  (Z'Z)-1 . 

c.  Check  that  the  matrices  Z'Z  and  (Z'Z)-1  are 
both  positive  definite. 

d.  Compute  the  two  eigenvalues  of  the  matrix  Z'Z 
and  check  that  the  sum  and  product  of  these  two 
eigenvalues  are  respectively  equal  to  the  trace 
and  the  determinant  of  this  matrix. 

e.  Write  f(c\,  cx)  =  (y  —  Zc)'(y  —  Zc)  as  a  function 
of  ci  and  c2.  Derive  the  gradient  and  the  Flessian 
matrix  of  this  function. 


f.  Use  the  results  in  e  to  compute  the  minimum  of 
the  function  f(c\,cx)  and  prove  that  this  is  a 
global  minimum. 

g.  Check  that  the  global  minimum  in  f  is  obtained 
for  c  =  (Z'Z)-1Z'y. 

h.  The  model  of  this  exercise  corresponds  to  the 
original  model  y  =  b\  +  bxx  +  b3z  under  the  re¬ 
striction  that  b i=0.  The  values  of  ci  and  cx 
computed  in  f  minimize  the  function 
g(bi,b2,b3)  =  Y),L i  (y>  ~  bi  ~  b2X,  -  b3z,)2  under 
the  restriction  that  b\  =  0.  Write  down  the  La¬ 
grange  function  A(/?i,  b2,  b3,  X)  that  corresponds 
to  this  restricted  minimization  problem,  and 
derive  the  four  first  order  conditions  for  a  min¬ 
imum  of  the  function  A. 

i.  Solve  the  four  equations  of  h  and  check  that  the 
restricted  estimates  of  bx  and  b3  are  equal  to 
the  computed  values  of  ci  and  cx  in  f.  What  is  the 
shadow  price  of  the  restriction  b\  =  0?  Give  an 
interpretation  of  the  shadow  price, 
using  the  results  of  Example  A.  1 1 . 

j.  Use  a  software  package  to  check  the 
computed  values  of  c\  and  Cx  in  f. 
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This  appendix  describes  the  data  sources  and  the  definitions  of  the  variables  of  all 
empirical  data  sets  used  in  the  examples  and  in  the  empirical  exercises.1  The 
numerical  data  can  be  downloaded  from  the  web  site  of  the  book.  All  econometric 
analysis  in  the  examples  and  exercises  in  the  book  is  performed  with  the  software 
package  EViews  (version  3  suffices),  but  other  packages  can  also  be  used  in  most 
cases. 

The  names  of  the  data  files  start  with  XM  for  examples  and  with  XR  for 
exercises,  followed  by  three  digits  indicating  the  chapter  and  the  example  or 
exercise  number,  and  concluded  with  three  letters  to  indicate  the  data  content. 
For  instance,  the  file  XR210COF  contains  the  data  of  Exercise  2.10  on  coffee 
sales.  If  a  data  set  is  used  in  different  chapters,  then  separate  data  files  are  included 
for  each  chapter,  because  in  some  cases  different  variables  are  analysed  in  the 
different  chapters.  Some  of  the  original  data  sources  contain  additional  variables 
that  are  not  mentioned  if  they  are  not  used  in  this  book.  Missing  values  in  the 
original  data  sources  are  not  deleted. 

The  list  on  p.  748  facilitates  the  use  of  this  appendix.  For  instance,  if  you  need 
further  information  on  the  data  in  the  file  XR210COF,  the  list  shows  ‘COF’ 
(Coffee  Sales)  as  the  4th  data  set  described  in  this  appendix.  For  each  data  set, 
this  appendix  gives  information  on 

•  the  topic  of  the  data  set, 

•  the  type  of  data, 

•  the  source  of  the  data, 

•  the  meaning  of  the  variables, 

•  a  list  of  examples  and  exercises  where  the  data  are  used  in  the  book. 

The  data  sets  are  ordered  according  to  their  first  appearance  in  the  book. 

1  In  this  appendix  we  do  not  discuss  simulated  data  sets  (which  have  extension  SIM)  because 
the  main  text  describes  the  simulation  set-up  explicitly  for  each  case.  Some  simulated  data  sets 
can  be  downloaded  from  the  web  site  of  the  book  —  that  is,  the  data  needed  for  Exercises  4.12, 
5.23,  and  6.10,  and  for  the  examples  in  Appendix  A. 
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List  of  Data  Sets 

(The  chapters  where  the  data  set  is  used  are  in  parentheses.) 

1.  STU:  Student  Learning  (Chapters  1-3,  6). 

2.  BWA:  Bank  Wages  (Chapters  1-6). 

3.  SMR:  Stock  Market  Returns  (Chapters  2,  4,  5,  7). 

4.  COF:  Coffee  Sales  (Chapters  2-5). 

5.  PMI:  Primary  Metal  Industries  (Chapters  3,  7). 

6.  MGC:  Motor  Gasoline  Consumption  (Chapters  3,  5,  7). 

7.  FEX:  Food  Expenditure  (Chapters  4,  5). 

8.  FAS:  Fashion  Sales  (Chapters  5,  7). 

9.  IBR:  Interest  and  Bond  Rates  (Chapters  5,  7). 

10.  INP:  Industrial  Production  (Chapters  5,  7). 

11.  TOP:  Salaries  of  Top  Managers  (Chapter  5). 

12.  USP:  US  Presidential  Election  (Chapter  5). 

13.  DMF:  Direct  Marketing  for  Financial  Product  (Chapter  6). 

14.  DUS:  Duration  of  Strikes  (Chapter  6). 

15.  DJI:  Dow-Jones  Index  (Chapter  7). 

16.  MOM:  Mortality  and  Marriages  (Chapter  7). 

17.  TBR:  Treasury  Bill  Rates  (Chapter  7). 

18.  CAR:  Car  Production  (Chapter  7). 

19.  NEP:  Nuclear  Energy  Production  (Chapter  7). 

20.  GNP:  Gross  National  Product  (Chapter  7). 

21.  EXR:  Exchange  Rates  (Chapter  7). 

22.  STP:  Standard  and  Poor  Index  (Chapter  7). 

23.  MOR:  Market  for  Oranges  (Chapter  7). 
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1 .  Student  Learning  (STU) 

Topic.  Scores  of  students  of  the  Vanderbilt  University  in  the  USA. 

Type  of  data.  Cross  section,  609  observations,  4  variables. 

Source.  J.  S.  Butler,  T.  A.  Finegan,  and  J.  J.  Siegfried,  ‘Does  More  Calculus 
Improve  Student  Learning  in  Intermediate  Micro-  and  Macroeconomic  Theory?’, 
Journal  of  Applied  Econometrics ,  13/2  (1998),  185-202  (data  obtained  from  the 
journal  data  archive  on  the  Internet  site  qed.econ.queensu.ca/jae). 


Variable 

Meaning 

FGPA 

overall  grade  point  average  at  end  of  freshman  year  (on  a  scale 
from  0  to  4) 

SATM 

score  on  the  SAT  Mathematics  test  divided  by  100  (on  a  scale 
from  0  to  10) 

SATV 

score  on  the  SAT  Verbal  test  divided  by  100  (on  a  scale  from 

0  to  10) 

FEM 

gender  (1  for  females,  0  for  males) 

Datafile 

Used  in 

XM101STU2 

Examples  1.1-1. 7,  1.12,  1.13 

XR111STU3 

Exercises  1.11,  1.12,  2.14 

XR314STU 

Exercise  3.14 

XM608STU4 

Example  6.8 

XR615STU5 

Exercise  6.15 

2  Some  examples  use  the  variable  SATA,  the  average  SAT  score  defined  by  SATA  = 
0.5(SATM  +  SATV). 

This  is  a  subset  of  ten  randomly  selected  students  out  of  the  group  of  609  students. 

4  Whereas  the  previous  data  sets  of  609  students  concerned  the  microeconomics  course,  this 
data  set  contains  additional  data  of  490  students  on  the  macroeconomics  course  (the  data  for 
microeconomics  and  those  for  macroeconomics  are  contained  in  two  separate  files).  The  data  set 
contains  several  additional  variables,  such  as  GRINTERMICRO  and  GRINTERMACRO 
(obtained  grades  in  microeconomics  and  macroeconomics,  on  a  scale  from  0  to  4)  and 
MATHHIGH  (the  level  of  calculus  of  the  student,  0  if  3-4  credit  hours,  1  if  6-12  credit 
hours).  See  Exhibit  6.12  for  a  complete  list  of  the  variables. 

5  This  is  basically  the  same  data  set  as  the  data  set  for  microeconomics  (with  609  students)  in 
XM608STU,  with  the  difference  that  this  data  set  distinguishes  seven  attained  levels  of  math¬ 
ematics  (instead  of  two)  and  five  majors  (instead  of  three). 
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2.  Bank  Wages  (BW A) 

Topic.  Wages  of  employees  of  a  bank  in  the  USA. 

Type  of  data.  Cross  section,  474  observations,  8  variables. 

Source.  SPSS,  version  10,  2000,  data  file  bank2.sav  (with  thanks  to  SPSS). 


Variable 

Meaning 

SALARY 

current  yearly  salary  (in  dollars) 

LOGSALARY 

(natural)  logarithm  of  SALARY 

EDUC 

education  (number  of  finished  years) 

SALBEGIN 

yearly  salary  at  employee’s  first  position  at  same  bank 
(in  dollars) 

LOGSALBEGIN 

(natural)  logarithm  of  SALBEGIN 

GENDER 

gender  variable  (0  for  females,  1  for  males) 

MINORITY 

minority  variable  (0  for  non-minorities,  1  for  minorities) 

JOBCAT 

job  category  (1  for  administrative  jobs,  2  for  custodial  jobs, 
3  for  management  jobs) 

Datafile 

Used  in 

XR113BWA6 

Exercise  1.13 

XM202BWA7 

Examples  2.2,  2.6,  2.9,  2.11;  Section  2.1.4 

XM301BWA8 

Examples  3. 1-3. 3,  4.1;  Sections  3.1.1,  3.1.7,  3.3.2, 

3.3.4,  3.4.2,  3.4.4;  Exercises  3.13,  3.16 

XR414BWA 

Example  4.1;  Exercises  4.14,  4.15 

XM501BWA9 

Examples  5.1,  5.2,  5.4,  5.5,  5.8-5.10,  5.12,  5.15,  5.17; 
Exercises  5.24,  5.25 

XM513BWA10 

Examples  5.13,  5.17 

XM604BWA11 

Examples  6.4,  6.5 

XR613BWA12 

Exercise  6.13 

XR614BWA13 

Exercise  6.14 

6  Contains  only  the  data  on  current  salary  and  education. 

7  Contains  only  the  data  on  current  salary  and  education. 

8  For  simplicity  the  logarithm  of  current  salary  is  denoted  by  LOGSAL  (instead  of 
LOGSALARY). 

9  This  data  set  contains  dummy  variables  (DUMJCAT2  and  DUMJCAT3)  to  denote  the  job 
category,  where  the  first  category  is  taken  as  reference  category. 

10  This  file  contains  grouped  data  obtained  by  dividing  the  474  employees  into  twenty-six 
groups.  The  group  mean  of  LOGSALARY  is  denoted  by  MEANLOGSAL  and  that  of  EDUC  by 
MEANEDUC. 

11  This  file  contains  the  data  of  the  258  male  employees.  Example  6.5  uses  the  ordinal  variable 
ORDERJOBCAT  for  the  job  category  (instead  of  the  nominal  variable  JOBCAT),  with  values 
1  for  custodial  jobs,  2  for  administrative  jobs,  and  3  for  management  jobs. 

12  Contains  only  the  data  for  the  258  male  employees  and  contains  the  additional  variable 
PREVEXP,  the  previous  experience  (in  months). 

15  This  file  is  comparable  to  XM604BWA  and  contains  the  data  of  all  474  employees.  As 
males  are  the  reference  category  in  this  exercise,  the  variable  GENDER  is  now  redefined  as  0  for 
males  and  1  for  females. 
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3.  Stock  Market  Returns  (SMR) 

Topic.  Monthly  excess  returns  in  the  UK  for  the  sector  of  cyclical  consumer  goods. 

Type  of  data.  Time  series,  monthly  data  over  the  period  1980-99  (240  observa¬ 
tions),  2  variables. 

Source.  DataStream  (data  obtained  from  this  database  in  2000,  with  thanks  to 
Ronald  van  Dijk). 


Variable 

Meaning 

RENDCYCO 

excess  returns  on  an  index  of  104  stocks  in  the  sector  of  cyclical 
consumer  goods  (household  durables,  automobiles,  textiles, 
and  sports)  in  the  UK  (in  percentages)14 

RENDMARK 

excess  returns  on  an  overall  stock  market  index  of  the  total 
market  in  the  UK  (in  percentages) 

Datafile 

Used  in 

XR201SMR 

Examples  2.1,  2.5;  Exercises  2.1,  2.2,  2.11,  2.12 

XR215SMR15 

Exercise  2.15 

XM404SMR 

Examples  4.4,  4.5,  4.7;  Section  4.4.6 

XR417SMR16 

Exercise  4.17 

XM527SMR 

Examples  5.27,  5.28 

XR530SMR17 

Exercise  5.30 

XR722SMR 

Exercise  7.22 

14  The  excess  returns  are  defined  as  follows.  Let  p;  be  the  closing  price  of  the  index  at  the  last 
trading  day  in  month  i  and  let  r;  be  the  one-month  interest  rate  at  the  start  of  month  i.  Then  the 
return  v y  of  the  index  over  month  i  is  defined  by  =  (p,  —  /A-i  )/pi-\  and  the  excess  return  is 
defined  by  Vj  —  r,.  The  reported  numbers  in  the  data  file  are  percentages  —  that  is,  100(t',  —  rp. 
For  r,  we  take  the  so-called  middle  rate. 

5  This  file  also  contains  the  excess  returns  of  stock  indices  of  three  other  sectors  —  namely, 
Noncyclical  Consumer  Goods  (RENDNCCO),  Information  Technology  (REND1T),  and  Tele¬ 
communication,  Media  and  Technology  (RENDTEL). 

16  This  file  contains  the  excess  returns  for  the  sector  of  Noncyclical  Consumer  Goods 
(RENDNCCO)  and  the  market  returns  (RENDMARK). 

17  This  file  also  contains  the  excess  returns  of  the  stock  index  for  the  sector  of  Noncyclical 
Consumer  Goods  (RENDNCCO). 
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4.  Coffee  Sales  (COF) 

Topic.  Effect  of  price  reductions  on  coffee  sales  in  suburban  areas  in  Paris. 

Type  of  data.  Cross  section  data  of  weekly  coffee  sales  for  two  brands  of  coffee  in 
a  controlled  marketing  experiment,  12  observations,  3  variables. 

Source.  A.  C.  Bemmaor  and  D.  Mouchoux,  ‘Measuring  the  Short-Term  Effect  of 
In-Store  Promotion  and  Retail  Advertising  on  Brand  Sales:  A  Factorial  Experi¬ 
ment’,  Journal  of  Marketing  Research,  28  (1991),  202-14  (data  for  two  brands  of 
regular  ground  coffee  (RGC1  and  RGC2)  are  obtained  from  table  3  on  p.  206). 18 


Variable 

Meaning 

QUANTITY 

quantity  of  coffee  sold  (weekly  total  unit  sales  of  the 
considered  shops) 

PRICE 

price  of  coffee  (indexed,  usual  price  has  value  1,  price  is  0.95 
for  5%  price  reduction  and  0.85  for  15%  price  reduction) 

DEAL 

deal  rate  (with  values  1,  1.05,  and  1.15,  defined  by 

DEAL  =  2-  PRICE) 

Datafile 

Used  in 

XR210COF19 

Examples  2.3,  2.7;  Exercise  2.10 

XR317COF20 

Exercise  3.17 

XM402COF21 

Examples  4.2,  4.6;  Section  4.2.5 

XR413COF22 

Exercise  4.13 

XM507COF23 

Example  5.7 

XR526COF24 

Exercise  5.26 

18  The  experiment  contains  six  weeks  with  no  actions,  six  weeks  with  price  reductions 
without  advertisement,  and  six  weeks  with  price  reductions  combined  with  advertisement.  The 
basic  data  set  considers  the  twelve  weeks  without  advertisement. 

19  This  file  contains  data  for  the  second  brand  of  coffee  (RGC2)  for  the  twelve  weeks  without 
advertisement. 

20  This  file  contains  data  for  the  first  brand  (RGC1)  with  eighteen  observations  (instead  of 
twelve),  including  the  six  observations  for  weeks  with  advertisement.  The  file  also  contains  the 
variable  A  (advertisement  dummy  with  value  1  in  weeks  with  advertisement  and  0  in  weeks 
without  advertisement). 

21  This  file  contains  twenty-four  observations,  twelve  observations  for  each  of  the  two  brands 
of  coffee  (RGC1  and  RGC2)  for  the  twelve  weeks  without  advertisement.  It  contains  as 
additional  variables  LOGQI  and  LOGQ2  (the  logarithms  of  the  sold  quantities  for  the  two 
brands),  D1  and  D2  (the  deal  rates  for  the  two  brands)  and  LOGD1  and  LOGD2  (the  logarithms 
of  D1  and  D2). 

22  This  file  contains  the  data  for  the  second  brand  of  coffee  (RGC2)  for  the  twelve  weeks 
without  advertisement. 

23  This  file  contains  the  same  twenty-four  observations  as  the  file  XM402COF.  Apart  from  the 
variables  QUANTITY,  PRICE,  and  DEAL,  it  also  contains  the  variables  LOGQ  (the  logarithm  of 
the  sold  quantity)  and  DUMRGC1  and  DUMRGC2  (dummy  variables  that  indicate  the  brand). 

24  This  file  contains  data  for  the  first  brand  (RGC1)  for  the  twelve  action  weeks,  six  weeks 
with  price  reductions  without  advertisement  and  six  weeks  with  price  reductions  combined  with 
advertisement.  It  contains  as  additional  variables  A  (advertisement  dummy,  1  in  weeks  with 
advertisement  and  0  otherwise)  and  DP  (price  dummy,  1  in  weeks  with  15  per  cent  price 
reduction  and  0  in  weeks  with  5  per  cent  price  reduction). 
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5.  Primary  Metal  Industries  (PMI) 

Topic.  Production  in  the  US  Primary  Metal  Industries  (SIC33). 

Type  of  data.  Pooled  annual  data  of  26  firms  over  the  period  1958-94  (37 
observations  per  firm),  3  variables. 

Source.  E.  J.  Bartelsman  and  W.  Gray,  ‘The  NBER  Manufacturing  Productivity 
Database’,  National  Bureau  of  Economic  Research,  NBER  Technical  Working 
Paper  205  (1996),  30  pp.  (data  obtained  from  the  Internet  site  www.nber.org  in 
1998,  with  thanks  to  Piet  Lesuis). 


Variable 

Meaning 

PRODUCTION 

value  added  (in  millions  of  1987  dollars) 

LABOUR 

total  payroll  of  production  worker  wages  (in 
1987  dollars) 

millions  of 

CAPITAL 

capital  stock,  both  structures  and  equipment 
1987  dollars) 

(in  millions  of 

Datafile 

Used  in 

XR315PMI25 

Exercise  3.15 

XM729PMI26 

Examples  7.29,  7.30 

25  Contains  the  three  variables  in  logarithms,  denoted  by  LOGY  (for  production),  LOGL  (for 
labour),  and  LOCK  (for  capital).  The  data  set  consists  of  a  cross  section  of  the  twenty-six  firms 
for  the  single  year  1994. 

2<1  Pooled  data  set;  contains  the  logarithms  of  the  three  variables  (denoted  by  LOGPROD, 
LOGLAB,  and  LOGCAP)  and  the  nominal  variable  ID  (indicating  the  firm,  ranging  from 
1  to  26). 
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6.  Motor  Gasoline  Consumption  (MGC) 

Topic.  Consumption  of  motor  gasoline  in  the  USA. 

Type  of  data.  Time  series,  yearly  data  over  the  period  1970-99  (30  observations), 
7  variables. 

Source.  Economic  Report  of  the  President  2000  (statistical  tables  obtained  from 
the  Internet  site  w3.access.gpo.gov/eop  in  2000)2/  and  Census  Bureau  and  De¬ 
partment  of  Energy  (data  obtained  from  the  database  Economagic  at  the  Internet 
site  www.economagic.com  in  2000). 28 


Variable 

Meaning 

SGAS 

nominal  retail  sales  of  gasoline  service  stations  (in  millions 
of  dollars) 

PGAS 

motor  gasoline  retail  price  (US  city  average,  in  cents  per  gallon) 

INC 

nominal  personal  disposable  income  (in  billions  of  dollars) 

PALL 

consumer  price  index  (indexed  so  that  the  average  value  over 
the  years  1982-84  is  equal  to  100) 

PPUB 

consumer  price  index  of  public  transport  (indexed  in  the 
same  way) 

PNCAR 

consumer  price  index  of  new  cars  (indexed  in  the  same  way) 

PUCAR 

consumer  price  index  of  used  cars  (indexed  in  the  same  way) 

Datafile 

Used  in 

XR318MGC29 

Exercises  3.18,  3.19 

XM531MGC30 

Examples  5.31,  5.34;  Exercise  5.31 

XR721MGC31 

Exercise  7.21 

27  The  variable  INC  is  taken  from  file  b29,  the  variable  PALL  from  file  b60,  and  the  variables 
PPUB,  PNCAR,  and  PUCAR  from  file  b59b. 

28  The  variable  SGAS  is  obtained  from  the  Census  Bureau  (retail  sales  by  kind  of  business, 
gasoline  service  stations)  and  the  variable  PGAS  from  the  Department  of  Energy  (energy  prices, 
motor  gasoline  retail  prices,  all  types).  The  original  monthly  data  for  SGAS  are  aggregated  to 
yearly  data  and  the  original  monthly  data  for  PGAS  are  averaged  to  yearly  data.  The  data  for 
PGAS  are  given  in  Economagic  for  the  years  1978-99;  the  values  for  1970-77  are  obtained  from 
W.  H.  Greene,  Econometric  Analysis  (3rd  edn.,  Prentice  Hall,  1997),  p.  328,  after  appropriate 
scaling. 

22  The  exercises  use  several  variables  that  are  defined  directly  in  terms  of  the  variables 
mentioned  above,  by  taking  variables  in  real  terms  (instead  of  nominal)  and  by  taking  loga¬ 
rithms. 

30  This  file  contains  the  following  variables  in  real  terms,  and  taken  in  logarithms: 
GC  =  log  (SGAS/PGAS)  (gasoline  consumption),  PG  =  log  (PGASS/PALL)  (real  price  of 
gasoline),  RI  =  log  (INC/PALL)  (real  income),  RPT  =  log  (PPUB/PALL)  (real  price  of  public 
transport),  RPN  =  log  (PNCAR/PALL)  (real  price  of  new  cars),  and  RPU  =  log  (PUCAR/PALL) 
(real  price  of  used  cars). 

31  This  file  is  the  same  as  XM531MGC. 
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7.  Food  Expenditure  (FEX) 

Topic.  Household  expenditure  on  food  and  beverages. 

Type  of  data.  Cross  section,  54  observations,32  5  variables. 

Source.  J.  R.  Magnus  and  M.  S.  Morgan  (eds.),  ‘The  Experiment  in  Applied 
Econometrics’,  Journal  of  Applied  Econometrics ,  12/5  (1997),  651-61,  special 
issue  (data  obtained  from  Experiment  Information  Pack  of  the  editors,  file 
BS50US,  see  the  journal  data  archive  at  the  Internet  site  qed.econ.queensu.ca/jae). 


Variable 

Meaning 

TOTCONS 

total  consumption  expenditure  (in  10,000  dollars  per  year) 

FOODCONS 

food  and  beverage  consumption  expenditure  (in  10,000 
dollars  per  year) 

FRACFOOD 

fraction  of  expenditure  spent  on  food 
(FOODCONS/TOTCONS)33 

AHSIZE 

household  size  (average  in  the  group  of  households) 

SAMPSIZE 

number  of  households  in  the  group 

Datafile 

Used  in 

XR416FEX 

Example  4.3;  Exercise  4.16 

XM520FEX34 

Examples  5.20,  5.23,  5.25;  Exercise  5.27 

’2  The  data  are  group  averages  of  12,448  households  (divided  into  fifty-four  groups).  The 
household  data  are  obtained  by  an  interview  survey  in  195f  in  the  USA  on  income  and 
expenditure  over  the  whole  year  of  1950. 

The  original  data  source  contains  the  variables  TOTCON  and  FOODCON  measured 
in  dollars  per  year,  here  we  use  TOTCONS  =  TOTCON/10,000  and  FOODCONS  = 
FOODCON/10,000. 

34  Contains  restricted  data  set  of  forty-eight  groups  obtained  by  deleting  the  six  smallest 
groups  (with  SAMPSIZE  <  20).  The  data  set  is  ordered  in  segments  according  to  the  variable 
AHSIZE,  and  for  groups  within  the  same  segment  according  to  the  variable  TOTCONS  (for 
details  see  Example  5.20  (p.  356-8)).  The  data  can  also  be  ordered  randomly,  and  the  random 
ordering  discussed  in  the  book  corresponds  to  the  variable  RAND  in  the  data  file. 
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8.  Fashion  Sales  (FAS) 

Topic.  US  retail  sales  of  high  priced  fashion  apparel. 

Type  of  data.  Time  series,  quarterly  data  over  the  period  1986-92  (28  observa¬ 
tions),  3  variables  (and  6  derived  variables). 

Source.  G.  M.  Allenby,  L.  Jen,  and  R.  R  Leone,  ‘Economic  Trends  and  Being 
Trendy:  The  Influence  of  Consumer  Confidence  on  Retail  Fashion  Sales’,  Journal 
of  Business  and  Economic  Statistics,  14/1  (1996),  103-1 1  (data  obtained  from  the 
journal  data  site  ftp://www.amstat.org/  JBESJView/).35 


Variable 

Meaning 

SALES 

real  fashion  sales  (in  millions  of  dollars  per  thousand  square  feet 
of  retail  space) 

PURABI 

purchasing  ability  (real  personal  disposable  income  divided  by 
consumer  price  index  for  apparel) 

CONFI 

consumer  confidence  (an  index  of  consumer  sentiment  of  the 
University  of  Michigan  Survey  Research  Center) 

LOGSALES 

logarithm  of  SALES 

LOGA 

logarithm  of  PURABI 

LOGC 

logarithm  of  CONFI 

D2,  D3,  D4 

quarterly  dummy  variables  for  quarters  2,  3,  4 

Datafile 

Used  in 

XM506FAS 

Example  5.6 

XR725FAS36 

Exercise  7.25 

35  The  original  monthly  data  are  aggregated  to  quarterly  data.  See  p.  105  of  the  article  for  a 
detailed  definition  of  the  variables. 

36  Pooled  data  with  twenty-eight  quarterly  sales  data  for  each  of  five  speciality  divisions.  The 
divisions  1  and  2  specialize  in  high-priced  fashion  apparel  (division  1  corresponds  to  the  data  set 
XM506FAS),  division  3  in  low-priced  merchandise,  and  divisions  4  and  5  in  specialities  like  large 
sizes,  undergarments,  and  so  on. 
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9.  Interest  and  Bond  Rates  (IBR) 

Topic.  Treasury  Bill  rate  and  AAA  bond  rate  in  the  USA. 

Type  of  data.  Time  series,  monthly  data  over  the  period  1950-99  (600  observa¬ 
tions),  2  variables. 

Source.  Federal  Reserve  Board  of  Governors  (data  obtained  from  the  database 
Economagic  at  the  Internet  site  www.economagic.com  in  2000). 37 


Variable 

Meaning 

DUS3MT 

monthly  change  in  the  three-month  US  Treasury  Bill  rate 
(in  percentages)38 

DAAA 

monthly  change  in  the  AAA  corporate  bond  yield 
(in  percentages) 

Datafile 

Used  in 

XM511IBR39 

Examples  5.11,  5.12,  5.14,  5.16,  5.18,  5.19,  5.21,  5.22,  5.24, 
5.30,  5.32,  5.33 

XR528IBR40 

Exercise  5.28 

XM722IBR41 

Examples  7.22,  7.25-7.27,  7.32;  Exercise  7.19 

The  Treasury  Bill  rate  is  the  three-month  rate  (auction  average)  and  the  AAA  bond  rate  is 
Moody’s  Seasoned  AAA,  a  monthly  average  over  long-term  bonds  of  firms  with  AAA  rating. 

lS  If  the  value  of  the  rate  in  month  i  is  denoted  by  r,  and  that  of  the  previous  month  by  r,_i, 
then  the  change  DUS3MT,  in  month  i  is  defined  by  r,  —  i. 

39  This  file  also  contains  the  dummy  variable  DUM7599  with  value  1  for  1975-99  and  value  0 
for  1950-74,  which  is  used  in  Examples  5.16  and  5.18. 

40  This  file  contains  the  monthly  levels  (in  percentages)  of  the  three-month  Treasury  Bill  rate, 
denoted  by  US3MTBIL,  so  that  DUS3MT,  =  US3MTBIL,  -  US3MTBIL,_!.  The  data  period  is 
January  1985  to  December  1999  (180  observations). 

41  Apart  from  the  first  differences  DUS3MT  and  DAAA,  this  file  also  contains  the  level 
US3MTBIL  of  the  three-month  Treasury  Bill  rate  and  the  level  AAA  of  the  AAA  bond  rate  that 
are  used  in  Examples  7.25  and  7.27  and  in  Exercise  7.19  (both  levels  are  in  percentages).  This  file 
contains  monthly  data  over  the  period  1948-99,  so  that  the  values  in  1948  and  1949  can  be  used 
as  pre-sample  values  to  estimate  models  for  the  period  1950-99. 
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10.  Industrial  Production  (INP) 

Topic.  Index  of  total  industrial  production  in  the  USA. 

Type  of  data.  Time  series,  quarterly  data  over  the  period  1950.1-1998.3  (195 
observations),42  1  variable. 

Source.  OECD  Main  Economic  Indicators  (data  obtained  from  the  database 
DataStream  in  1999). 


Variable 

Meaning 

IP 

total  industrial  production  index  (indexed  so  that  the  average 
value  over  the  four  quarters  of  1992  is  equal  to  100) 

Datafile 

Used  in 

XM526INP 

Example  5.26;  Exercise  5.29 

XM701INP43 

Examples  7.1,  7.7,  7.8,  7.10,  7.11,  7.13,  7.14,  7.16-7.18, 
7.20;  Exercise  7.16 

42  In  most  of  the  analysis  the  data  prior  to  1 96 1 . 1  are  used  as  starting  values  and  the  data  from 
1995.1-1998.3  are  left  out  for  forecast  evaluation  purposes,  in  which  case  the  effective  sample 
ranges  from  1961.1  to  1994.4  with  136  observations. 

43  This  file  contains  the  variables  X  (this  is  IP,  the  level  of  the  index),  Y  (defined  by 
Y  =  log(X),  the  logarithm  of  the  series  IP),  and  D4Y  (defined  by  D4Y  =  Y  —  Y(  —  4)  = 
log  (X/X(  —  4) )  ps  (X  —  X(  —  4)  )/X(  —  4),  the  yearly  growth  rate  of  the  series  IP).  In  addition, 
in  Example  7.16  we  use  the  variable  YEARSUMY  (the  year  total  Y  +  Y(  —  1)  +  Y(  —  2)  + 
Y(  —  3)  over  the  last  four  quarters)  and  in  Example  7.17  we  use  some  dummy  variables  for 
specific  observations  (for  instance,  the  dummy  variable  DUM61 1  has  value  1  for  the  first  quarter 
of  1961  and  value  0  for  all  other  quarters). 
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1 1 .  Salaries  of  Top  Managers  (TOP) 

Topic.  Salaries  of  top  managers  and  profits  of  the  100  largest  firms  in  the  Nether¬ 
lands  in  1999. 44 

Type  of  data.  Cross  section,  100  observations,  3  variables. 

Source.  Annual  reports  of  firms  (data  obtained  from  the  Dutch  newspaper 
Volkskrant  at  the  Internet  site  www.volkskrant.nl  in  2000). 


Variable 

Meaning 

SALARY 

PROFIT45 

TURNOVER46 

average  yearly  salary  in  1999  of  top  managers  of  the  same 
firm  (in  thousands  of  Dutch  guilders) 
profit  of  the  firm  in  1999  (in  millions  of  Dutch  guilders) 
turnover  of  the  firm  in  1999  (in  millions  of  Dutch  guilders) 

Datafile 

Used  in 

XM535TOP47 

Example  5.35;  Exercise  5.32 

44  The  set  includes  firms  such  as  ABN-AMRO,  Ahold,  ING,  Philips,  Shell,  and  Unilever. 

45  One  missing  value  (in  sector  of  social  services),  three  firms  with  negative  profits  (in 
telecommunication  sector),  ninety-six  firms  with  positive  profits. 

46  Thirteen  missing  values  (in  banking  and  insurance  sector). 

4  The  model  is  formulated  in  terms  of  logarithms  of  the  variables  (denoted  by  LOGSALARY, 
LOGPROFIT,  and  LOGTURNOVER).  In  most  cases  the  analysis  is  restricted  to  the  ninety-six 
firms  with  positive  profits,  the  total  turnover  of  the  four  dropped  firms  is  less  than  1  per  cent  of 
the  total  turnover  of  the  100  firms.  In  one  case  we  use  the  variable  TURNOVER;  this  leaves 
eighty-four  observations. 
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12.  US  Presidential  Election  (USP) 

Topic.  US  presidential  election  in  2000,  results  for  the  state  Florida. 

Type  of  data.  Cross  section,  observations  for  67  counties  in  Florida,48 10  variables 
(and  2  derived  variables). 

Source.  CBS  (data  obtained  via  the  Internet  site  www.cbsnews.com  in  2000,  with 
thanks  to  Ruud  Koning).49 


Variable 

Meaning 

BROWNE 

number  of  votes  for  candidate  Browne 

BUCHANAN 

idem  for  Buchanan 

BUSH 

idem  for  Bush 

GORE 

idem  for  Gore 

HAGELIN 

idem  for  Hagelin 

HARRIS 

idem  for  Harris 

MCREYNOLDS 

idem  for  McReynolds 

MOOREHEAD 

idem  for  Moorehead 

NADER 

idem  for  Nader 

PHILLIPS 

idem  for  Phillips 

TOTAL 

total  number  of  votes  in  the  county50 

DUMPALM 

dummy  variable  for  the  county  Palm  Beach  ( 1  for  county 
50  (Palm  Beach),  0  for  the  other  66  counties) 

Datafile 

Used  in 

XR533USP 

Exercise  5.33 

48  The  reported  values  are  the  number  of  votes  on  each  candidate  counted  automatically, 
before  recounting  by  hand,  and  excluding  votes  by  mail. 

49  Similar  data,  of  the  10  November  recount,  are  discussed  by  B.  E.  Hansen  (see  the  Internet 
site  www.ssc.wisc.edu/~bhansen/vote  for  further  information).  The  data  differ  somewhat  be¬ 
cause  of  the  recounts. 

80  Defined  as  the  sum  of  the  number  of  votes  on  the  ten  mentioned  candidates. 
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13.  Direct  Marketing  for  Financial  Product  (DMF) 

Topic.  Response  of  customers  of  a  commercial  bank  to  direct  marketing  campaign 
for  a  new  financial  product  (‘click  funds’). 

Type  of  data.  Cross  section,  925  observations,51  5  variables. 

Source.  P.  H.  Franses,  ‘On  the  Econometrics  of  Modeling  Direct  Marketing 
Response’,  RIBES  report  97-15,  Rotterdam,  1997  (with  thanks  to  Robeco). 


Variable 

Meaning 

RESPONSE 

response  (dummy  variable,  1  if  customer  invests  in  the  new 
product  and  0  otherwise) 

INVEST 

amount  of  money  invested  by  the  customer  in  the  new 
product  (in  hundreds  of  Dutch  guilders) 

GENDER 

gender  (1  for  males,  0  for  females) 

ACTIVITY 

activity  indicator  (1  if  customer  already  invests  in  other 
products  of  the  bank  and  0  otherwise) 

AGE 

age  of  customer  (in  years) 

Datafile 

Used  in 

XM601DMF52 

Examples  6. 1-6.3,  6.6,  6.7;  Exercises  6.11,  6.12,  6.16 

51  The  original  database  contains  more  than  100,000  observations. 

’2  The  file  contains  the  additional  variable  LOGINV  defined  by  LOGINV  =  log  (1  + 
INVEST). 
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14.  Duration  of  Strikes  (DUS) 

Topic.  Duration  of  contract  strikes  in  US  manufacturing. 

Type  of  data.  Cross  section,  62  observations,  2  variables. 

Source.  J.  Kennan,  ‘The  Duration  of  Contract  Strikes  in  US  Manufacturing’, 
Journal  of  Econometrics,  28  (1985),  5-28  (the  data  are  in  table  1  on  pp.  14-16 
of  this  paper).53 


Variable 

Meaning 

STRIKEDUR 

PROD 

strike  duration  (length  of  finished  strikes,  measured  in  days) 
index  of  unanticipated  industrial  production  in  manufacturing 
(the  value  0  corresponds  to  normal  conditions) 

Datafile 

Used  in 

XM609DUS54 

Example  6.9;  Exercise  6.17 

53  The  data  concern  official  strikes  in  US  manufacturing  industries  for  the  period  1968-76, 
involving  1000  workers  or  more,  with  major  issue  classified  as  general  wage  changes  by  the 
Bureau  of  Labor  Statistics.  Attention  is  restricted  to  strikes  beginning  in  June  of  each  year  to 
remove  seasonal  effects  (see  Kiefer  (1988),  listed  in  the  Further  Reading  of  Chapter  6  (p.  524)). 

54  The  file  contains  the  additional  variable  STRIKECENS80,  which  is  obtained  by  censoring 
the  actual  durations  at  a  maximum  of  80  days,  so  that  STRIKECENS80  =  min(STRIKEDUR, 
80). 
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15.  Dow-Jones  Index  (DJI) 

Topic.  Dow-Jones  industrial  average. 

Type  of  data.  Time  series,  daily  data  from  1990  (2  January)  to  1999  (31  Decem¬ 
ber),  2528  observations,  1  variable. 

Source.  Economagic  (data  obtained  from  the  stock  price  indices  at  the  Internet  site 
www.economagic.com  in  2000). 


Variable 

Meaning 

DJ55 

Dow-Jones  Industrial  Average  index  (daily  close) 

Datafile 

Used  in 

XM702DJI56 

Examples  7.2,  7.15,  7.21 

5  The  days  are  numbered  consecutively  so  that  closing  days  are  not  counted. 

56  The  file  contains  variables  derived  from  the  Dow-Jones  index  —  namely,  LOGDJ  (the 
logarithm  of  DJ)  and  DLOGDJ  (the  daily  differences  of  LOGDJ  —  that  is,  the  daily  returns  on 
the  Dow-Jones  index).  For  the  purposes  of  Example  7.21  it  also  contains  the  derived  variables 
DJRET  (the  daily  returns;  this  is  the  same  series  as  DLOGDJ  but  the  two  names  are  useful  in 
different  settings)  and  DJABSRET  (the  absolute  returns,  defined  as  the  absolute  value  of  DJRET). 
The  file  further  contains  the  auxiliary  variables  DD  (day  of  the  month,  1-31),  MM  (month  of  the 
year,  1-12),  and  YEAR  (year,  1990-1999)  to  relate  the  time  series  to  calendar  time. 
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16.  Mortality  and  Marriages  (MOM) 

Topic.  Mortality  in  England  and  Wales  and  proportion  of  Church  of  England 
marriages. 

Type  of  data.  Time  series,  yearly  data  1866-1911  (46  observations),  2  variables. 
Source.  G.  U.  Yule,  ‘Why  do  we  Sometimes  Get  Nonsense-Correlations  between 
Time-Series?’,  Journal  of  the  Royal  Statistical  Society ,  89  (1926),  1-64  (the  data 
are  reconstructed  from  figure  1  on  p.  3  of  this  paper;  see  the  comments  in  Example 
7.23  (p.  648-9)). 


Variable 

Meaning 

STMORT 

standardized  mortality  in  England  and  Wales 
(per  1000  persons) 

CEMARR 

proportion  of  Church  of  England  marriages  (per  1000  of  all 
marriages) 

Datafile 

Used  in 

XM723MOM 

Example  7.23 

Appendix  B.  Data  Sets  765 


17.  Treasury  Bill  Rates  (TBR) 

Topic.  Treasury  Bill  rates  in  the  USA  for  three  different  maturities. 

Type  of  data.  Time  series,  monthly  data  over  the  period  1960-99  (480  observa¬ 
tions),  3  variables. 

Source.  Federal  Reserve  Board  of  Governors  (data  obtained  from  the  database 
Economagic  at  the  Internet  site  www.economagic.com  in  2001). 57 


Variable 

Meaning 

T_3M 

3-month  Treasury  Bill  rate  (measured  in  percentages) 

T_1Y 

1-year  Treasury  Bill  rate  (measured  in  percentages) 

T_10Y 

10-year  Treasury  Bill  rate  (measured  in  percentages) 

Datafile 

Used  in 

XM728TBR58 

Example  7.28 

5  The  3-month  Treasury  Bill  rate  is  the  secondary  market  series;  the  1-year  and  10-year 
Treasury  Bill  rates  are  the  constant  maturity  series. 

The  file  also  contains  the  three  spreads  between  the  Treasury  Bill  rates  —  that  is,  the 
three  differences  DIFF_T10YT1Y  =  T_10Y  -  T_1Y,  DIFF_T10YT3M  =  T_10Y  -  T_3M, 
and  DIFFJT1YT3M  =  T_1Y  -  T_3M. 
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18.  Car  Production  (CAR) 

Topic.  Production  of  Japanese  passenger  cars. 

Type  of  data.  Time  series,  monthly  data  over  the  period  1980.1-2001.3  (255 
observations),  2  variables. 

Source.  DataStream  (data  obtained  from  this  database  in  2001). 


Variable 

Meaning 

TOYOTA 

JPOUTPUT 

production  volume  of  passenger  cars  by  Toyota  (number  of  cars) 
total  volume  of  produced  passenger  cars  in  Japan 
(number  of  cars) 

Datafile 

Used  in 

XR717CAR59 

Exercise  7.17 

59  The  file  also  contains  the  variable  ALLMINTOY  =  JPOUTPUT  —  TOYOTA.  In  addition  it 
contains  production  volumes  of  passenger  cars  of  eight  other  industries  (Daihatsu,  Fuji,  Honda, 
Isuzu,  Mazda,  Mitsubishi,  Nissan,  and  Suzuki)  that  are  not  used  in  the  exercise  but  that  can  be 
analysed  in  a  similar  way  as  the  series  of  Toyota. 
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19.  Nuclear  Energy  Production  (NEP) 

Topic.  Generation  of  electricity  from  nuclear  electric  power  in  the  USA. 

Type  of  data.  Time  series,  monthly  data  over  the  period  1973.1-1999.11  (323 
observations),  1  variable. 

Source.  Department  of  Energy  (data  obtained  from  the  database  Economagic  at 
the  Internet  site  www.economagic.com  in  2000). 


Variable 

Meaning 

NUCEP 

net  generation  of  electricity  from  nuclear  electric  power  (in  Tbtu) 

Datafile 

Used  in 

XR718NEP60 

Exercise  7.18 

60  The  file  also  contains  some  additional  energy  production  series  (of  petroleum,  of  natural 
gas,  and  of  electricity  generated  by  geothermal  energy  and  by  hydropower,  as  well  as  a  total 
energy  production  series  for  the  USA).  These  series  are  not  used  in  the  exercise  but  they  can  be 
analysed  in  a  similar  way  as  the  series  of  nuclear  electric  power. 
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20.  Gross  National  Product  (GNP) 

Topic.  Gross  national  product  of  four  countries. 

Type  of  data.  Time  series,  yearly  data  over  the  period  1870-1993  (124  observa¬ 
tions),  4  variables. 

Source.  A.  Maddison,  Monitoring  the  World  Economy  1820-1992  (OECD,  1995) 
(the  data  are  taken  from  table  C.16  on  pp.  180-3). 


Variable 

Meaning 

GERMANY 

JAPAN61 

UK 

USA 

real  gross  domestic  product  of  Germany  (in  millions  of 
Geary-Khamis  dollars) 

real  gross  domestic  product  of  Japan  (same  units) 
real  gross  domestic  product  of  UK  (same  units) 
real  gross  domestic  product  of  USA  (same  units) 

Datafile 

Used  in 

XR720GNP62 

Exercise  7.20 

61  The  series  for  Japan  has  missing  values  for  1871-84,  leaving  110  observations. 

61  The  file  also  contains  the  logarithms  of  the  four  GNP  series,  denoted  by  LOGGER, 
LOGJAP,  LOGUK,  and  LOGUSA.  " 
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21 .  Exchange  Rates  (EXR) 

Topic.  Exchange  rates  and  price  indices  of  Germany  and  the  UK. 

Type  of  data.  Time  series,  monthly  data  over  the  period  1957.1-1998.4  (496 
observations,  the  exercise  uses  only  the  240  observations  of  the  years  1975-94), 
4  variables. 

Source.  International  Financial  Statistics  (data  obtained  from  the  database  Data- 
Stream  in  2000). 


Variable 

Meaning 

X_UK 

X_G 

P_UK 

P_G 

nominal  exchange  rate  of  British  Pound  to  1  US  dollar 
nominal  exchange  rate  of  Deutsche  Mark  to  1  US  dollar 
consumer  price  index  for  the  UK  (indexed  so  that  the  average 
over  1990  is  equal  to  100) 

consumer  price  index  for  (Western)  Germany  (indexed  in  the 
same  way) 

Datafile 

Used  in 

XR723EXR63 

Exercise  7.23 

63  The  file  also  contains  the  logarithms  of  the  four  series  (denoted  by  LOG_X_UK, 
LOG_X_G,  LOG_P_UK,  and  LOG_P_G),  as  well  as  nominal  exchange  rates  and  consumer 
price  indices  for  Canada,  France,  Japan,  and  the  Netherlands,  together  with  the  consumer  price 
index  of  the  USA. 
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22.  Standard  and  Poor  Index  (STP) 

Topic.  Standard  and  Poor  composite  stock  price  index. 

Type  of  data.  Time  series,  yearly  data  1871-1987  (117  observations),  2  variables. 

Source.  R.  J.  Shiller,  Market  Volatility  (MIT  Press,  1989)  (the  data  are  taken  from 
tables  26.1  and  26.2  on  pp.  440-3). 


Variable 

Meaning 

REALSP 

REALDIV 

stock  price  of  Standard  and  Poor  index  (January,  in  real  terms) 
yearly  dividends  on  Standard  and  Poor  index  (in  real  terms) 

Datafile 

Used  in 

XR724STP64 

Exercise  7.24 

64  The  file  contains  several  other  variables,  in  particular  SP  (nominal  stock  price),  DIV 
(nominal  yearly  dividends),  and  PP  (producer  price  index).  The  real  variables  are  defined  by 
REALSP  =  SP/PP  and  REALDIV  =  DIV/PP.  The  file  also  contains  the  series  of  earnings  (EAR), 
interest  rate  (INT),  real  consumption  (RC),  and  consumer  price  index  (CPI). 
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23.  Market  for  Oranges  (MOR) 

Topic.  Relation  between  price  of  oranges  and  quantity  traded  in  the  USA. 

Type  of  data.  Time  series,  yearly  data  1910-59  (50  observations),  5  variables. 
Source.  M.  Nerlove  and  F.  V.  Waugh,  ‘Advertising  without  Supply  Control:  Some 
Implications  of  a  Study  of  the  Advertising  of  Oranges Journal  of  Farm  Econom¬ 
ics,  43  (1961),  813-37  (data  obtained  from  table  1  on  p.  827).  See  also  E.  R. 
Berndt,  The  Practice  of  Econometrics:  Classic  and  Contemporary  (Addison- 
Wesley,  1991),  417-20. 


Variable 

Meaning 

QTY 

PRICE65 

INC 

CURADV66 

AVEADV 

quantity  of  oranges  sold  (number  of  boxes  per  capita) 
real  price  of  a  box  of  oranges  (year  average,  in  dollars) 
real  disposable  income  per  capita  (in  dollars) 
current  year  real  advertising  expenditures  (in  cents  per  capita) 
average  real  advertising  expenditures  (in  cents  per  capita, 
averaged  over  the  ten  preceding  years) 

Datafile 

Used  in 

XR726MOR67 

Exercise  7.26 

65  This  variable 

is  defined  by  PRICE  =  REV/QTY,  where  REV  is  the  per  capita  real  revenue 

from  sales  of  oranges  (in  dollars). 

66  The  variables  CURADV  and  AVEADV  concern  advertising  expenditures  for  oranges  by 
Sunkist  Growers  and  the  Florida  Citrus  Commission. 

67  This  file  contains  some  additional  variables  —  namely,  REV  (per  capita  real  revenue  from 
sales  of  oranges),  POP  (population  of  the  USA,  in  millions),  and  CPI  (consumer  price  index  used 
to  produce  the  real  series).  Further  it  contains  the  logarithmic  variables  LOGQT  (the  log  of 
QTY),  LOGPT  (the  log  of  PRICE),  LOGRIT  (the  log  of  INC),  LOGAC  (the  log  of  CURADV), 
and  LOGAP  (the  log  of  AVEADV). 
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Index 


2SLS  (see  two-stage  least  squares)  400,  705 
3SLS  ( see  three-stage  least  squares)  706 
absence  of  serial  correlation  93 
accuracy  of  least  squares 
multiple  regression  152-60 
simple  regression  87-98 
ACF  (see  autocorrelation  function)  546 
adaptive  expectations  639 
adding  or  deleting  variables  134-51 
additive  heteroskedasticity  327 
additive  outlier  612 
additive  seasonal  604 

ADF  (see  augmented  Dickey-Fuller  test)  599 

adjusted  R-squared  130-1 

adjustment  coefficient  668 

ADL  (see  autoregressive  distributed  lag)  637 

AIC  ( see  Akaike  information  criterion)  279,  565 

airline  model  606 

Akaike  information  criterion  (AIC)  279,  565 
alternative  hypothesis  55 
approximate  finite  sample  distribution 
ML  228 
OLS  197 

AR  (see  autoregressive  model)  362,  539 
ARCH  (see  autoregressive  conditional 
heteroskedasticity)  621-2 
ARIMA  (see  autoregressive  integrated  moving 
average)  580 

ARMA  (see  autoregressive  moving  average 
model)  544,  546 
assumptions 

regression  model  92-3,  125-6 
simple  regression  model  92-4 
weaker,  in  regression  103-4 
asymptotic  analysis  188 
regression  model  188-201 
asymptotic  approximation  197 
asymptotic  covariance  matrix  229,  256 
asymptotic  distribution  50 
ML  228 


normal  50,  207 
of  b  196-7 
OLS  197 

asymptotic  normality  (OLS)  196-8 
asymptotic  properties  48,  193 
conditions,  NLS  207 
estimation  methods  47-54 
maximum  likelihood  51-3,  228-30 
asymptotically  efficient  228 
asymptotically  normal  207,  228 
augmented  Dickey-Fuller  test  (ADF)  599 
autocorrelation  536,  545-8 
Breusch-Godfrey  LM-test  569 
Ljung-Box  test  568 
partial  547 
sample  548 

autocorrelation  coefficient  361 
autocorrelation  function  (ACF)  546 
partial  547 
sample  548,  564 

autoregressive  conditional  heteroskedasticity 
(ARCH)  621-2 
generalized  620-3,  626-9 
LM-test  628 

autoregressive  distributed  lag  (ADL)  637 
autoregressive  disturbances  (in  regression 
model)  369 

autoregressive  integrated  moving  average 
(ARIMA)  580 

autoregressive  model  (AR)  362,  539 
smooth  transition  617 
stationarity  condition  539 
stationary  539 
threshold  617 
time  series  538-42,  558-9 
with  distributed  lags  637-47 
autoregressive  moving  average  model 
(ARMA)  542-6 
ARMA-GARCH  623 
diagnostic  tests  567-71 
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autoregressive  moving  average  model  ( Contd .) 
estimation  560-2 
explanatory  variables  637 
identification  556,  563 
stationary  544 

autoregressive  moving  average  process 
(implied  by  VAR)  658-9 
auxiliary  regression  140,  216 
LM-test  215-6,  239 

backward  elimination  281 

bandwidth  (Newey-West)  360 

bandwidth  span  291 

bank  wages  (data  set  2)  750 

baseline  hazard  514 

Bayes  information  criterion  (BIC)  279 

Bernoulli  distribution  29 

best  linear  unbiased  estimator  (BLUE)  97-8, 

127 

BHHH  (see  method  of  Berndt,  Hall,  Hall, 
and  Hausman)  226 
bias  43,  278 

and  efficiency,  trade-off  145,  277-8 
censoring,  OLS  492-3 
correction  term  495 
omitted  variable  143 
selection,  OLS  502 
treatment,  OLS  504 
truncation,  OLS  486 
BIC  (see  Bayes  information  criterion)  279 
bilinear  process  714 
binary  response  438-62 
grouped  data  459-61 
latent  variable  441 
marginal  effect  440 
model  438-43 
parameter  restriction  441 
summary  461-2 
utility  442 
binary  variable  438 
binomial  distribution  29 
bisquare  function  (robust  estimation)  430 
BLUE  (see  best  linear  unbiased  estimator)  97-8, 
127 

bootstrap  65 
bottom-up  approach  281 
Box-Cox  transformation  297 
Box-Pierce  test  364 
breakpoint  315 
break  test  (Chow)  315 
Breusch-Godfrey  test  364,  569 
computational  scheme  364 


Breusch-Pagan  test  345 
computational  scheme  345 
BWA  (see  bank  wages,  data  set  2)  750 

calculation  rules  for  matrices  728 
canonical  correlation  coefficient  670 
capital  asset  pricing  model  (CAPM)  91 
CAPM  (see  capital  asset  pricing  model)  91 
CAR  (see  car  production,  data  set  18)  766 
car  production  (data  set  18)  766 
Cauchy  distribution  33 

CDF  (see  cumulative  distribution  function)  20 
censored  data  490-500 
censored  distribution  492 
normal  491,  493 
censored  variable  490 
censoring  bias  (OLS)  492-3 
Census  X-12  605 
central  limit  theorem  50 
generalized  51 
centred  moment  22 
ceteris  paribus  140,  274 
changing  variance  (time  series)  621 
chi-square  distribution  32 
Chow  break  test  315 
Chow  forecast  test  173-4,  316 
CL  (see  conditional  logit  model)  466-70 
classification  table  453 
clustered  volatility  621 
Cochrane-Orcutt  method  369 
coefficient  of  determination  83,  129-31 
COF  (see  coffee  sales,  data  set  4)  752 
coffee  sales  (data  set  4)  752 
cointegrated  time  series  652 
cointegration  652,  667-74 
analysis  667-80 
VAR  model  667-74 
cointegration  relation  668 
number  671 

vector  error  correction  model  668-9 
cointegration  test  (Johansen)  671-3 
critical  values  672 
column  vector  726 
common  trend  652 

comparison  of  tests  (F,  LM,  LR,  W)  240-2 
computer  science  1-2 
concentration  231,  743 
conditional  distribution  24 
conditional  expectation  24 
conditional  heteroskedasticity  621-35 
autoregressive  621-2 
generalized  autoregressive  620-3,  626-9 
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conditional  interpretation  (stochastic 
regressors)  438 

conditional  logit  model  (CL)  466-70 
marginal  effect  468 
conditional  model  464 
conditional  prediction  107 
conditional  probit  model  465 
conditional  variance  24-5,  621-3 
consistency  (OLS)  193-6 
consistent  194,  228 
estimator  48 
test  56 

constant  coefficients  (test)  315 
constant  DGP  (test)  171-2 
constant  term  120 
contemporaneous  correlation  684 
continuous  function  739-40 
continuous  random  variable  20 
controlled  experiment  92 
convergence  in  distribution  50 
convergence  in  probability  48 
correlation 

contemporaneous  684 
matrix  18 
nonsense  647 
serial  354-77,  637-55 
correlation  coefficient  23 
canonical  670 
sample  18 

correlogram  361,  563 
covariance  23 
sample  18 

covariance  matrix  126 
asymptotic  229,  256 
robust  258 
covariate  79 
Cramer-Rao  bound  43 
critical  region  56 
critical  values 

cointegration  tests  672 
unit  root  tests  595 

cross  section  data  749-50,  752,  755,  759-62 
cumulative  distribution  function  (CDF)  20 
sample  13,  20 

cumulative  sum  of  squares  test  (CUSUMSQ) 

314 

cumulative  sum  test  (CUSUM)  313 
curse  of  dimensionality  662 
CUSUM  test  (cumulative  sum)  313 
CUSUMSQ  test  (cumulative  sum  of  squares)  314, 
343 


data 

deviation  from  sample  mean  147-8 
graphs  12-15 
matrix  725 

transformation  296-301 
data  generating  process  (DGP)  42,  87 
test  on  constancy  171-2 
data  sets 
list  748 

overview  and  description  of  variables  747-71 
decomposition  of  time  series  604 
degrees  of  freedom  129 
deleting  or  adding  variables  134-51 
density 

function  20 
logistic  443 
standard  normal  443 
symmetric  441-2 
truncated  485 

dependence  (random  variables)  17 
dependent  variable  79 
lagged  637-42 
qualitative  438-81 
derivative  739 
partial  740 

descriptive  statistics  12-19 
determinant  (matrix)  732 
determination  (coefficient  of)  83,  129-31 
deterministic  seasonal  605 
deterministic  trend  578 
detrending  148 

deviations  from  sample  mean  147-8 
dfbetas  383 
dffits  383 

DGP  ( see  data  generating  process)  42,  87 
diagnostic  tests  275 
logit  and  probit  452-7 
time  series  567-71 
time  series,  summary  571 
diagnostic  tests  and  model  adjustments 
(summary,  further  reading,  keywords) 
424-6 

diagonal  matrix  726 
Dickey-Fuller 
augmented  599 
critical  values  595 
distribution  594 
F-test  593-6 
t-test  594-6 

difference  operator  297,  580 
difference  stationary  580 
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differencing  (stochastic  trends)  651-2 
over-differencing  607 
differentiable  function  739-40 
direct  effect  140 

direct  marketing  for  financial  product  (data  set 
13)  761 

discrete  choice  438-81 
discrete  random  variable  20 
distributed  lag  637 
distribution  (probability)  20 
asymptotic,  OLS  197 
asymptotic  normal  50,  207 
Bernoulli  29 
binomial  29 
Cauchy  33 
censored  492 
censored  normal  491,493 
chi-square  32 
conditional  24 
exponential  516 
extreme  value  466 
F  34 

finite  sample,  ML  228 
finite  sample,  OLS  197 
joint  23 

large  sample  206,  228-30 
log-normal  513 
marginal  23 

mixed  continuous-discrete  491 
multivariate  normal  30 
non-standard  594 
normal  29,  31 
standard  normal  30 
Student  t  32 
t  32 

truncated  normal  486-7 
Weibull  513 
disturbance  88,  93,  125 
assumptions  93,  125 
distribution  378-95 
distribution,  summary  394-5 
variance,  estimate  127-9 
DJI  (see  Dow-Jones  index,  data  set  15)  763 
DMF  (see  direct  marketing  for  financial  product, 
data  set  13)  761 

Dow-Jones  index  (data  set  15)  763 
drift  term  (random  walk)  581 
dummy  variable  303 
observation  379 
seasonal  303,  605 
use  303-10 
duration  511 


duration  model  511-16 

duration  of  strikes  (data  set  14)  762 

Durbin- Watson  test  362 

Durbin-Wu-Hausman  test  410 

DUS  (see  duration  of  strikes,  data  set  14)  762 

dynamic  forecast  570 

dynamic  models  (summary,  further  reading, 
keywords)  710-12 

dynamic  simultaneous  equation  model  706-7 

ECM  (see  error  correction  model)  639-40 
econometric  modelling  1-3,  87,  274-6 
economics  (modelling)  1-3,  274-6 
efficiency  (simple  regression)  97-8 
efficient  43 

asymptotically  228 
eigenvalue  (matrix)  734 
eigenvalue  decomposition  734-5 
eigenvector  734 
elasticity  296 
constant  203 
varying  203-4 
empirical  cycle  276 
endogeneity  396-7 
endogenous  regressors  396-418 
summary  418 
endogenous  variable  79 
error  correction  model  (ECM)  639-40 
error  of  the  first  type  56 
error  of  the  second  type  56 
error  term  88 
estimate  38 
interval  64 

interval,  in  regression  100 
point  63-4 
significance  99 
estimation  methods  38-42 
asymptotic  properties  47-54 
comparison  41,  222 
non-parametric  289-95 
statistical  properties  42-7 
estimation  sample  170,  280 
estimator  38 

evaluation  (model)  274-6 
EWMA  (see  exponentially  weighted  moving 
average)  588 

exactly  identified  parameter  253 
exchange  rates  (data  set  21)  769 
exclusion  restriction  705 
exogeneity  409 
test  409-12 
exogenous  194 
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exogenous  variable  79,  398 
expectation  21 
conditional  24 
experiment  (controlled)  92 
experimental  data  140 
explained  sum  of  squares  (SSE)  83,  130 
explained  variable  79 
explanatory  variables  79 
choice  277-85 
stability  condition  193 
summary  302 

exponential  distribution  516 
exponential  hazard  model  513 
exponential  smoothing  588 
exponentially  weighted  moving  average 
(EWMA)  588 

EXR  (see  exchange  rates,  data  set  21)  769 
extreme  value  distribution  466 

FAS  (see  fashion  sales,  data  set  8)  756 
fashion  sales  (data  set  8)  756 
fat  tails  223-4 
F-distribution  34 

feasible  generalized  least  squares  (FGLS)  686 
feasible  GLS  (FGLS)  686 
feasible  weighted  least  squares  (FWLS)  335 
approximate  distribution  336 
computational  scheme  335 
grouped  binary  response  460 
FEX  (see  food  expenditure,  data  set  7)  755 
FGLS  (see  feasible  generalized  least  squares)  686 
finite  sample  distribution 
approximate,  ML  228 
approximate,  OLS  197 
finite  sample  properties  197,  228 
first  difference  297 

first  order  autocorrelation  coefficient  361 

first  order  condition  739 

fitted  value  80 

fixed  effects  (panels)  693 

fixed  regressors  92,  125 

food  expenditure  (data  set  7)  755 

forecast 

dynamic  570 
multi-step-ahead  550 
one-step-ahead  550 
static  570 
see  also  prediction 

forecast  performance  (time  series)  569-70 
forecast  test  (Chow)  173-4,  316 
forecasting 
ADL  642 


dynamic  570 
Holt-Winters  589 
static  570 

stationary  time  series  550-3 
time  series  with  trends  585-9 
forward  selection  281 
Frisch-Waugh  146-7 
F-test  161-6 
basic  form  162 
geometric  interpretation  163 
IV  estimation  406-7 
with  R-squared  163 
full  rank  assumption  122,  125 
functional  form 
non-linear  285-9 
summary  302 

FWLS  (see  feasible  weighted  least  squares)  335 

GARCH  (see  generalized  ARCH)  620-3,  626-9 
Gauss-Markov  theorem  98,  127 
Gauss-Newton  method  211-12 
generalized  ARCH  (GARCH)  620-3,  626-9 
generalized  autoregressive  conditional 

heteroskedasticity  (GARCH)  620-3,  626-9 
diagnostic  tests  627-9 
estimation  626-7 
use  in  risk  modelling  629 
generalized  least  squares  (GLS)  685 
feasible  686 
two-step  feasible  686 

generalized  method  of  moments  (GMM)  250-65 
computational  scheme  254,  258-9 
estimator  258-9,  325 
motivation  250-1 
simple  regression  260-2 
standard  error  255-9 
weighting  matrix  254,  256,  259 
generalized  residuals  516 
general-to-specific  method  281 
geometric  interpretation 
F-test  163 
OLS  123-5 
global  maximum  739 
global  minimum  739 
GLS  (see  generalized  least  squares)  685 
GMM  (see  generalized  method  of 
moments)  250-65 

GNP  (see  gross  national  product,  data  set  20)  768 
Goldfeld-Quandt  test  343 
goodness  of  fit  453 
gradient  740 

outer  product  226 
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Granger  causality  test  663 
gross  national  product  (data  set  20)  768 
grouped  data  459 
binary  choice  459-61 
heteroskedasticity  328-9 
growth  rate  297 

HAC  ( see  heteroskedasticity  and  autocorrelation 
consistent)  360 
hat  matrix  123 
Hausman  test  410 

computational  scheme  411 
hazard 

baseline  514 
proportional  514 
hazard  rate  512 
exponential  513 
Heckman  two-step  method  495 
computational  scheme  495,  503-4 
Hessian  matrix  741 
heteroskedasticity  93,  320-53 
additive  327 
conditional  621-35 
consequences  for  OLS  324 
estimation  by  ML  and  FWLS  334-7 
GMM  261-2 
grouped  data  328-9 
LM-test,  computational  scheme  457 
logit  and  probit  455 
multiplicative  327 
summary  352-3 

heteroskedasticity  and  autocorrelation  consistent 
(HAC)  360 
histogram  12 
hit  rate  453 
hold-out  sample  276 
Holt-Winters  forecast  589 
homoskedastic  93,  125 
homoskedasticity  (tests)  343-6 
Huber  criterion  391 
hypothesis  55 
alternative  55 
null  55 

hypothesis  test  55-67 

IBR  (see  interest  and  bond  rates,  data  set  9)  757 
idempotent  matrix  737 

identically  and  independently  distributed  (IID)  39 
identification  (ARMA  model)  556,  563 
identification  restriction  (SEM)  705 
identified  parameter  206 
exactly  identified  253 


over-identified  253 
identity  matrix  726 
IID  (see  identically  and  independently 
distributed)  39 

independence  (random  variables)  25 
conditions  34-5 

independence  of  irrelevant  alternatives  469 
independent  variable  79 
qualitative  303-4 
index  function  441 
indirect  effect  140 

industrial  production  (data  set  10)  758 
inefficient  144 
influential  data  379 
influential  observation  378,  384 
information  criteria  279 
information  matrix  45,  228 
alternative  expressions  243 
initial  values  560-1,  661 
innovation  outlier  613 
innovation  process  537 

INP  (see  industrial  production,  data  set  10)  758 
instrument  398 
validity  412-14 
weak  405 

instrumental  variable  (IV)  398 
estimation  396-404 
estimation,  statistical  properties  404-9 
estimator  399 
motivation  396 
summary  418 
integrated  process  580 
integration  order  580-1 
interaction  term  286 
intercept  79 

R-squared  in  model  without  84 
interest  and  bond  rates  (data  set  9)  757 
interval  estimate  64 
regression  100 
invariant  40 
inverse  matrix  730 
computation  733-4 
inverse  Mills  ratio  486 
invertibility  condition  543 
invertible  (MA  process)  543-4 
invertible  matrix  730 

irrelevant  alternatives  (independence  of)  469 
iterated  FGLS  687 
iterated  FWLS  337 

iterative  optimization  (computational  scheme) 
209 

IV  (see  instrumental  variable)  398 
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Jacobian  matrix  27 
Jarque-Bera  test  387 
Johansen  trace  test  670-3 
joint  distribution  23 
joint  random  variables  23-7 
joint  significance  164 
J-test  258 

kernel  360 
kernel  function  292 
kernel  method  292 
kurtosis  16,  386 

LAD  ( see  least  absolute  deviation)  390 
lag  operator  539 

lagged  dependent  variable  637-42 
lagged  variable  (regression  model)  368 
Lagrange  method  213,  743-4 
Lagrange  multiplier  212-18,  744 
interpretation  744 

Lagrange  multiplier  test  (LM)  214-18,  235-8, 
240-2 

auxiliary  regression  215 
computational  scheme  215,  218,  240 
linear  model  213,  238-40 
relation  with  F- test  216 
large  sample  approximation  229 
large  sample  distribution  206,  228-30 
large  sample  standard  error  228-30,  448 
latent  trend  583 

latent  variable  (binary  response)  441 
law  of  large  numbers  50 
least  absolute  deviation  390 
least  squares  39 

accuracy,  multiple  regression  152-60 

accuracy,  simple  regression  87-98 

computational  scheme  122 

criterion  80,  121 

disadvantages  222,  381 

estimator  95-6,  121-2 

generalized  685 

geometry  124 

matrix  form  118-33 

multiple  regression  1 1 8-5 1 

non-linear  205-8 

ordinary  80,  121-2 

recursive  310-13 

restricted  181-2 

seen  as  projection  123-4 

simple  regression  76-87 

terminology  79 

three-stage  706 


two-stage  400,  705 
unbiased  95,  126 
variance  96,  126 
weighted  290,  327-30 
leave-one-out  method  (outliers)  380 
level  shift  614 
level  variable  297 
leverage  379 

likelihood  function  40,  225 
likelihood  ratio  test  (LR)  230-2,  240-2 
limited  dependent  variables  482-522 
summary  521-2 

summary,  further  reading,  keywords  523-4 
linear  dependence  733 
linear  model  93 
linear  probability  model  439 
linear  restriction  165 
linearization  210-12 
Ljung-Box  test  365,  568 
LM-test  ( see  Lagrange  multiplier  test)  214-18, 
235-8,  240-2 
local  maximum  739 
local  minimum  739 
local  regression  289 

computational  scheme  292-3 
logarithmic  transformation  296 
logistic  density  443 
logit  and  probit  model  443-6 
comparison  444 
logit  model  444 
conditional  466-70 
diagnostic  tests  452-7 
estimation  and  evaluation  447-50 
heteroskedasticity  455 
marginal  effect  445 
multinomial  466-70 
scaling  445 
log-likelihood  52,  225 
log-linear  model  296 
log-normal  distribution  513 
log-odds  446,  469 
longitudinal  data  692-7 
long-run  multiplier  637 

LR-test  (see  likelihood  ratio  test)  230-2,  240-2 

MA  (see  moving  average)  542-3,  547,  560,  564 
MAE  (see  mean  absolute  error)  280 
marginal  distribution  23 
marginal  effect 
binary  choice  440 

dummy  variable  in  binary  choice  445-6 
logit  and  probit  445 
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marginal  effect  ( Contd. ) 

multinomial  and  conditional  logit  468 
non-constant  286 
ordered  response  475-6 
tobit  493 

truncated  model  487 
market  for  oranges  (data  set  23)  771 
mathematics  1-2 
matrix 

addition  727 
calculation  rules  727-8 
data  725 

multiplication  727 
notation  726 

notation  in  econometrics  120,  726 
regression  in  matrix  form  120 
matrix  methods  (overview)  723-46 
maximum  739,  741 
global  739 
local  739 

maximum  likelihood  (ML)  40,  222-49 
asymptotic  properties  51-3,228-30 
computational  scheme  230 
linear  model  227 
motivation  222-4 
quasi  259-60 

McFadden’s  R-squared  453 
mean  21 
sample  16 

mean  absolute  error  (MAE)  280 
mean  absolute  prediction  error  570 
mean  reverting  process  579 
mean  squared  error  (MSE)  43 
measurement  error  191,  268 
median  16 

method  of  Berndt,  Hall,  Hall,  and  Hausman 
(BHHH)  226 

method  of  moments  39,  252 
generalized  250-65 

MGC  ( see  motor  gasoline  consumption,  data  set 
6)  754 

minimal  variance  127 
minimum  739,  741 
global  739 
local  739 

misspecification  test  275,  286 
mixed  continuous-discrete  distribution  491 
ML  (see  maximum  likelihood)  40,  222-49 
MNL  ( see  multinomial  logit  model)  466-70 
model  38 

model  adjustment  (serial  correlation)  368-70 


model  adjustments  and  diagnostic  tests  (summary, 
further  reading,  keywords)  424-6 
model  evaluation  274-6 
model  identification  556 
model  selection  274-6 
time  series  563-5 
time  series,  summary  576-7 
VAR  model  661-2 
modelling  274 

multiple  time  series  with  trends  673-4 
time  series  555 

MOM  (see  mortality  and  marriages,  data  set 
16)  764 
moment  22 
centred  22 

generalized  method  of  moments  250-65 
method  of  moments  39,  252 
non-centred  516 
sample,  centred  16 
sample,  uncentred  516 
moment  conditions  253 
test  258 

MOR  (see  market  for  oranges,  data  set  23)  771 
mortality  and  marriages  (data  set  16)  764 
motor  gasoline  consumption  (data  set  6)  754 
moving  average  (MA)  542-3,  547,  560,  564 
invertible  543-4 

MSE  (see  mean  squared  error)  43 
multicollinearity  158-9 
multinomial  data  463-81 
summary  480-1 

multinomial  logit  model  (MNL)  466-70 
marginal  effect  468 
multinomial  model  464 
multinomial  probit  model  465 
parameter  restriction  466 
multiple  equation  models  682-709 
summary  709 
multiple  regression  1 1 8-5 1 
assumptions  125-6 
efficiency  127 
F-test  161-6 

geometric  interpretation  123-5 

interpretation  140-1,  147 

prediction  169-74 

statistical  properties  125-7 

summary,  further  reading,  keywords  178-9 

t-test  152-4,  164 

multiplicative  heteroskedasticity  327 
multiplicative  model  296 
multiplicative  seasonal  604 
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multiplier 

Lagrange  212-18,  744 
long-run  637 
short-run  637 

multi-step-ahead  forecast  550 
multivariate  normal  distribution  30 
multivariate  time  series  models  656-709 
error  correction  660 
fixed  effects  panel  692-4 
random  effects  panel  695-7 
seemingly  unrelated  regression  684-8 
simultaneous  equations  700-9 
stationary  vector  autoregression  656-66 
trends  and  cointegration  667-81 

nearest  neighbour  fit  290 
negative  definite  matrix  736 
negative  semidefinite  matrix  736 
neglected  dynamics  354 
NEP  (see  nuclear  energy  production,  data  set 
19)  767 

Newey-West  standard  error  360 
Newton-Raphson  method  210,  226 
NID  (see  normally  and  independently 
distributed)  35 

NLS  (see  non-linear  least  squares)  206 
nominal  variable  463 
non-centred  moment  516 
non-centred  R-squared  456 
non-experimental  data  140 
non-invertible  matrix  733 
non-linear  functional  form  285-9 
non-linear  least  squares  (NLS)  205-8 
computational  scheme  208 
properties  206-8 

non-linear  methods  (summary,  further  reading, 
keywords)  266-7 

non-linear  optimization  209-12,  226 
non-linear  regression  202-21 
non-linear  regression  model  205 
motivation  202-5 

non-linearities  of  time  series  (summary)  636 
non-parametric  estimation  289-95 
non-parametric  model  289 
nonsense  correlations  647 
non-standard  distribution  594 
non-stationary  process  578-82 
normal  distribution  29 
asymptotic  207,  228 
properties  31 
normal  equations  82,  121 


non-linear  206,  394 
normal  random  sample  35-7 
normality 

assumption  93,  126 
asymptotic,  OLS  196-8 
test  386-7 

normally  and  independently  distributed  (NID) 

35 

notation 

Greek  and  Latin  symbols  91 
matrix,  econometrics  120,  726 
random  variables  20-1 
nuclear  energy  production  (data  set  19)  767 
null  hypothesis  55 

number  of  cointegration  relations  (computational 
scheme)  671 

numerical  method  209-12 
numerical  optimization  209-12,  226 
numerical  precision  84 

odds  ratio  446 

OLS  (see  ordinary  least  squares)  80,  121-2 
omitted  variable  142-3 
bias  143 
one-sided  test  60 
one-step-ahead  forecast  550 
optimization  738-44 
iterative  209 
non-linear  209-12,  226 
numerical  209-12,  226 
order  condition  398,  705 
order  of  integration  580-1 
ordered 

alternatives  474 
data  310 

response  data  474-7 
variable  463 
ordered  logit  model  477 
ordered  probit  model  477 
ordered  response  model  474 
diagnostic  tests  477 
marginal  effect  475-6 
threshold  values  475 
ordinal  variable  474 
ordinary  least  squares  (OLS)  80,  121-2 
geometric  interpretation  123-5 
properties  92-8,  125-7 
properties  under  heteroskedasticity  324-5 
properties  under  serial  correlation  358-61 
orthogonality  condition  194 
outer  product  of  gradients  226 
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outlier  379 
additive  612 
innovation  613 
leave-one-out  method  380 
limitation  OLS  381 
time  series  612-14 

out-of-sample  performance  170-4,  280,  569-71 
overall  significance  (regression)  164 
over-differencing  607 
over-identified  parameter  253 
over-identifying  restrictions  258 

PACF  ( see  partial  autocorrelation  function)  547 
panel  data  682,  692-7,  753,  756 
panel  model 

fixed  effects  693 
random  effects  695 
parameter  29,  38,  93,  125 
constancy  93,  125 
estimation  38-54 
estimation,  time  series  558-62 
exactly  identified  253 
identified  206 
over-identified  253 
penalty  term  279 

restriction,  binary  response  model  441 
restriction,  multinomial  probit  466 
scale  483 

stability  test  314-16 
time-varying  303,  616-18 
varying  303-19 
partial  adjustment  model  638 
partial  autocorrelation  547 
sample  548,  564 

partial  autocorrelation  function  (PACF)  547 
sample  548,  564 
partial  derivative  740 
partial  regression  145-8 
computational  scheme  146 
scatter  plot  148 
partitioned  matrix  729 
penalty  term  (number  of  parameters)  279 
Phillips-Perron  test  597-8 
piece-wise  linear  relation  304 
pivotal  (statistic)  36 
plim  (see  probability  limit)  48-9 
PMI  (see  primary  metal  industries,  data  set  5)  753 
point  estimate  63-4 
point  prediction  105 
polynomial  model  282 
pooled  data  682,  753,  756 
pooled  estimator  of  variance  1 82 


positive  definite  matrix  736 
positive  semidefinite  matrix  736 
power  56 

practical  significance  57 
precision  of  reported  results  84 
prediction 

conditional  107 

error  537 

interval  106,  171 

multiple  regression  169-74 

out-of-sample  280 

point  105 

sample  170,  280 

simple  regression  105-10 

unconditional  107 

variance  of  errors  105-6,  170-1 

see  also  forecast 

predictive  performance  169,  280 
present  value  theory  720 
primary  metal  industries  (data  set  5)  753 
probability  distribution  20,  29-35 
probability  limit  (plim)  48 
calculation  rules  49 
probability  value  (P-value)  60,  153 
probit  and  logit  model  443-6 
comparison  444 
probit  model  444 
conditional  465 
diagnostic  tests  452-7 
estimation  and  evaluation  447-50 
heteroskedasticity  455 
marginal  effect  445 
multinomial  465 
scaling  445 

process  (see  stationary  process,  non-stationary 
process) 

product  (matrices)  727 
projection  (least  squares)  123-4 
projection  matrix  737 
properties  of  OLS  92-8,  125-7 
under  heteroskedasticity  324-5 
under  serial  correlation  358-61 
proportional  hazard  model  514 
purchasing  power  parity  720 
P-value  (see  probability  value)  60,  153 

QML  ( see  quasi-maximum  likelihood)  259-60 
qualitative  dependent  variables  438-81 
summary,  further  reading,  keywords  523-4 
qualitative  independent  variable  303-4 
quasi-maximum  likelihood  (QML)  259-60 
computational  scheme  260 
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Ramsey’s  RESET  285 
random  20 

disturbance  92,  125 
effects,  panels  695 
regressors  193 
random  sample  35 
normal  35-7 
random  variable  20-37 
continuous  20 
discrete  20 
independent  25 
joint  23-7 
notation  20-1 
transformation  22,  27 
random  walk  579 
with  drift  581 
rank  (matrix)  733 
rank  condition  398 
rate  of  convergence  201 
recursive  least  squares  310-13 
recursive  residuals  311 
redundant  variable  143-5 
reference  quarter  304 
regimes  616 
regression 

auxiliary  140,  216 

auxiliary,  LM-test  215-16,  239 

local  289,  292-3 

non-linear  202-21 

overall  significance  164 

partial  145-8 

seemingly  unrelated  684-8 

significance  164 

spurious  647,  650-2 

subset  315 

summary  of  computations  154 
trending  variables  647-54 
see  also  simple  regression,  multiple  regression 
regression  coefficients  (interpretation)  139-42 
regression  diagnostics  379-84 
regression  model 

assumptions  92-3,  125-6 
asymptotic  analysis  188-201 
lagged  variables  637-55 
matrix  form  120 
restricted  and  unrestricted  135-9 
weaker  assumptions  103-4 
regression  results  (way  of  presentation)  102,  155-6 
regression  specification  error  test  (RESET)  285 
regressor  79 

conditional  interpretation  438 
endogenous  396-418 


fixed  92,  125 
random  193 
stable  193,  199-201 
stochastic  191-3 
regularization  210 
rejection  region  56 
reparametrization  225 
reported  results 
precision  84 
regression  102,  155-6 

RESET  (see  regression  specification  error  test)  285 
residuals  82-3,  121 
generalized  516 
recursive  311 
restricted  135-7 
standardized  352,  454,  628 
studentized  380 
variance  96 

restricted  least  squares  estimator  181-2 
restricted  model  135-7 
restricted  residuals  135-7 
restriction  135,  165 
risk  modelling  (GARCH)  629 
RMSE  (see  root  mean  squared  error)  280 
robust  covariance  matrix  258 
robust  estimation  388-94 
scaling  393 

robust  standard  error  258 
root  mean  squared  error  (RMSE)  280 
root  mean  squared  prediction  error  570 
rounding  error  84 
row  vector  726 
R-squared  83 
adjusted  130-1 
F-test  163 

geometric  picture  130 
McFadden’s  453 
model  without  intercept  84 
non-centred  456 
rule  of  thumb  (t-test)  100 

SACF  (see  sample  autocorrelation  function)  548, 
564 

salaries  of  top  managers  (data  set  11)  759 
sample  autocorrelation  548 
sample  autocorrelation  function  (SACF)  548,  564 
sample  correlation  coefficient  18 
sample  covariance  1 8 
sample  cumulative  distribution  function 
(SCDF)  13,20 
sample  mean  16 

deviations  from  147-8 
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sample  moment 
centred  16 
uncentred  516 

sample  partial  autocorrelation  548,  564 
sample  partial  autocorrelation  function 
(SPACF)  548,564 
sample  selection  500-4,  525 
sample  standard  deviation  16 
sample  statistics  16-19 
sample  variance  16 
sandwich  estimator  258 
Sargan  test  413 
computational  scheme  413 
SARIMA  (see  seasonal  ARIMA)  606-7 
scale  parameter  483 
scaling  296 
logit  and  probit  445 
robust  estimation  393 
scatter  diagram  13,  77-80 
partial  regression  148 
simple  regression  76-9 
SCDF  (see  sample  cumulative  distribution 
function)  13,  20 

Schwarz  information  criterion  (SIC)  279,  565 

score  test  235 

seasonal 

additive  604 
multiplicative  604 
seasonal  adjustment  605 
seasonal  ARIMA  (SARIMA)  606-7 
seasonal  component  604 
seasonal  dummies  303,  605 
seasonality  604-7 
seasonals 

deterministic  605 
stochastic  606 
summary  611 

second  order  derivatives  (matrix)  741 
second  order  stationary  536 
seemingly  unrelated  regression  (SUR)  684-8 
selection  bias  (OLS)  502 
selection  effects  (model)  500-4 
SEM  (see  simultaneous  equation  model)  700-7 
serial  correlation  354-77,  637-55 
absence  93,  125 

causes  and  interpretation  354,  368 
consequences  for  OLS  358-9 
model  adjustment  368-70 
summary  376-7 
tests  361-5 
shadow  price  744 
short-run  multiplier  637 


SIC  (see  Schwarz  information  criterion)  279,  565 
significance  153 
estimate  99 
joint  164 
practical  57 
rule  of  thumb  100 
statistical  57 
significance  level  56 
significance  of  S(P)ACF  564 
significance  of  the  regression  164 
significance  test  (simple  regression)  99-104 
significant  60,  100 
similar  test  56 
simple  regression  76-87 
assumptions  92-4 
efficiency  97-8 
model,  examples  91-2 
model,  interpretation  94 
prediction  105-10 
scatter  diagram  13,  77-80 
significance  test  99-104 
statistical  properties  94-7 
summary,  further  reading,  keywords  111-12 
t-test  99-100 
simulation  87-9 

simulation  experiment  46,  87-91 
simulation  run  46 

simultaneous  equation  model  (SEM)  700-7 
dynamic  706-7 
single  random  variable  20-2 
singular  matrix  733 
size  56 

skewness  16,  386 
slope  coefficient  79 
t-value  100 

smooth  transition  autoregressive  model 
(STAR)  617 
switching  function  618 
smoothing  factor  588 

SMR  (see  stock  market  returns,  data  set  3)  751 
SPACF  (see  sample  partial  autocorrelation 
function)  548,  564 
span  291 

specification  error  (test)  285 
specific-to-general  approach  281 
spurious  regression  647,  650-2 
statistical  causes  650 
square  matrix  726 
square  root  (matrix)  736 
SSE  (see  explained  sum  of  squares)  83,  130 
SSR  (see  sum  of  squared  residuals)  83,  130 
SST  (see  total  sum  of  squares)  83,  130 
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stability  condition  (explanatory  variables)  193 
stability  test  (parameters)  314-16 
stable  regressor  193,199-201 
standard  and  poor  index  (data  set  22)  770 
standard  deviation  22 
sample  16 

standard  error  60,  128 
GMM  255-9 
large  sample  228-30,  448 
Newey-West  360 
of  b  100 

of  the  regression  100,  128 
robust  258 
White  325 

standard  normal  density  443 
standard  normal  distribution  30 
standardized  residuals  352,  454,  628 
STAR  ( see  smooth  transition  autoregressive 
model)  617 
static  forecast  570 
stationarity  condition 
AR  model  539 
VAR  model  660 
stationary  process  535-8 
AR  539 
ARMA  544 
difference  580 
second  order  stationary  536 
time  series  models  532-54 
time  series  models,  summary  553-4 
stationary  trend  582 
stationary  vector  autoregression  656-60 
statistic  38 

statistical  properties  (estimation  methods)  42-7 
maximum  likelihood  51-3,228-30 
multiple  regression  125-7 
simple  regression  94-7 
statistical  significance  57 
statistics  (summary,  further  reading, 
keywords)  68-70 
stochastic  regressor  191-3 
conditional  interpretation  438 
stochastic  seasonal  606 
stochastic  trend  579 
differencing  651-2 
stock  market  returns  (data  set  3)  751 
STP  (see  standard  and  poor  index,  data  set  22)  770 
structural  change  313-16,  616-17 
STU  (see  student  learning,  data  set  1)  749 
student  learning  (data  set  1)  749 
Student  t-distribution  32 
studentized  residuals  380 


submatrix  729-30 
subset  regression  315 
sum  (matrices)  727 
sum  notation  723 

sum  of  squared  residuals  (SSR)  83,  130 
summation  operator  723 

SUR  (see  seemingly  unrelated  regression)  684-8 
survival  function  5 12 
switching  function  (STAR  model)  618 
symmetric  density  function  441-2 
symmetric  matrix  729 

TAR  (see  threshold  autoregressive  model)  617 
TBR  (see  treasury  bill  rates,  data  set  17)  765 
t-distribution  32 
test 

comparison  of  F,  LM,  LR,  W  240-2 
diagnostic  275,  424-6 
distribution,  chi-square  or  F  242 
hypothesis  55-67 
mean  and  variance  59-63 
one-sided  60 

set  of  linear  restrictions  166 
significance  99-104 
similar  56 
two-sided  59 
test  statistic  56 
Theil  criterion  181 
three-stage  least  squares  (3SLS)  706 
computational  scheme  706 
threshold  autoregressive  model  (TAR)  617 
threshold  values  (ordered  response  model)  475 
time  series 

cointegrated  652 
decomposition  604 
diagnostic  tests  567-71 
forecast  evaluation  569-70 
forecasting  550-3 
integrated  580 

model  selection  563-5,  576-7 
modelling  555 
non-linearities  636 
outlier  612-14 
stationary  532-54 

summary,  further  reading,  keywords  710-12 
varying  parameters  616-18 
time  series  data  751,  754,  756-8,  763-71 
time  series  modelling  (ARMA)  555 
computational  scheme  556 
time-varying  parameters  303,  616-18 
time-varying  volatility  620-9 
summary  636 
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TMSP  ( see  total  mean  squared  prediction 
error)  278 
tobit  estimate  494 
tobit  model  490 
marginal  effect  493 
type  1  501 
type  2  501 

TOP  ( see  salaries  of  top  managers,  data  set  11)  759 
top-down  approach  281 
total  effect  140 

total  mean  squared  prediction  error  (TMSP)  278 
total  sum  of  squares  (SST)  83,  130 
trace  (matrix)  730 
trace  test  (Johansen)  670-3 
transformation 
data  296-301 
logarithmic  296 
random  variable  22,  27 
transformed  model  327 
transpose  (matrix)  729 
treasury  bill  rates  (data  set  17)  765 
treatment  (overall  effect)  506 
treatment  bias  (OLS)  504 
treatment  effect  504-6 
trend  297,  578-604 
cointegration  667-80 
common  652 
deterministic  578 
deterministic  or  stochastic  607 
latent  583 
stochastic  579 
summary  611 
trend  forecasting  585-9 
trend  models  578-85 
trend  stationary  582 
trending  variables  (regression)  647-54 
tricube  weighting  function  291 
true  model  87-8,  142 
truncated  density  485 
truncated  normal  distribution  486-7 
truncated  sample  482-90 
marginal  effect  487 
truncation  bias  (OLS)  486 
t-test  153 
mean  60 

multiple  regression  152-4,164 
relation  with  Wald  test  235 
rule  of  thumb  100 
simple  regression  99-100 
t-value  37,  153 
slope  coefficient  100 
two-sided  test  59 


two-stage  least  squares  (2SLS)  400,  705 
computational  scheme  400,  705 
two-step  feasible  generalized  least  squares 
(computational  scheme)  686 
two-step  feasible  weighted  least  squares 
(computational  scheme)  335 
type  I  error  56 
type  II  error  56 

unbiased  43,  95,  126 
best  linear  estimator  97-8,  127 
least  squares  95,  126 
unconditional  prediction  107 
uncontrolled  (variable)  140 
uncorrelated  24 
unit  root  580 
unit  root  test  592-600 
choice  of  test  equation  596-7 
critical  values  595 
unordered  response  data  463-6 
unordered  variable  463 
unrestricted  model  135-7 
US  presidential  elections  (data  set  12)  760 
USP  (see  US  presidential  elections,  data  set  12)  760 
utility 

binary  choice  442 
stochastic  463 

validity  of  instruments  413 
test  412-14 

VAR  ( see  vector  autoregressive  model)  656-81 
variable 

data  sets,  overview  747-71 
dependent  79 
endogenous  79 
exogenous  79,  398 
explained  79 

explanatory  79,  277-85,  302 
independent  79 
instrumental  396,  398 
lagged  dependent  637-42 
limited  dependent  482-522 
omitted  142-3 
ordered  463 
ordinal  474 

qualitative  dependent  438-81 
qualitative  independent  303-4 
random  20-37 
to  be  explained  79 
trending  647-54 
uncontrolled  140 
unordered  463 
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variable  addition  test  240 
variance  21 

changing,  time  series  621 
conditional  24-5,  621-3 
least  squares  96,  126 
minimal  127 
pooled  estimator  182 
prediction  errors  105-6,  170-1 
residuals  96 
sample  16 

variance  inflation  factor  159 
variance  reduction  143,  25 
varying  parameters  303-19 
summary  318-19 
tests  313-18 
time  series  616-18 

VECM  (see  vector  error  correction  model)  660 
vector  726 

vector  autoregressive  model  (VAR)  656-81 
estimation  and  diagnostic  tests  661-3 
estimation  with  cointegration  669-70 
implied  univariate  ARMA  658-9 
model  selection  661-2 
stationarity  condition  660 
summary  681 


vector  error  correction  model  (VECM)  660 
cointegration  relation  668-9 
estimation  with  cointegration  669-73 
volatility  620-35 
clustered  621 
time-varying  620-9 

Wald  test  (W)  232-5,  240-2 
relation  with  t-test  235 
weak  instrument  405 
Weibull  distribution  513 
weighted  least  squares  (WLS)  290,  327-30 
asymptotic  properties  330 
computational  scheme  330 
feasible  335 
two-step  feasible  335 
weighting  matrix  (GMM)  254,  256,  259 
white  noise  537 
White  standard  error  325 
White  test  345 

WLS  ( see  weighted  least  squares)  290, 
327-30 

W-test  (see  Wald  test)  232-5,  240-2 
zero  matrix  726 


